Grok released by xAI

Grok is an artificial intelligence inspired by "The Hitchhiker's Guide to the Galaxy" and is designed to answer almost any question and even provide problem-solving suggestions! Grok is designed to have a sense of humor and a rebellious spirit, so if you don't like humor, please don't use it! Grok's unique and fundamental advantage is its real-time understanding of the world through the xAI platform. It also answers questions that most other artificial intelligence systems refuse to answer. Grok is still in the early testing phase - this is the best result we have achieved in two months of training - so please expect it to improve rapidly with your help. Thank you, xAI Team.

Why We Built Grok#

At xAI, we aim to create artificial intelligence tools that assist humans in their pursuit of understanding and knowledge.

By creating and improving Grok, our goals are:

To gather feedback and ensure that we are building artificial intelligence tools that benefit all humans, regardless of their background and political views. We believe it is crucial to design AI tools that are useful for people with various backgrounds and political views. We also aim to provide AI tools to users while complying with the law. Our goal with Grok is to explore and demonstrate this approach in the open.
To empower research and innovation: We hope Grok becomes a powerful research assistant for anyone, helping them quickly access relevant information, process data, and come up with new ideas. Our ultimate goal is to have our artificial intelligence tools assist in the pursuit of understanding.

The Journey to Grok-1#

Grok-1 is the engine that powers Grok and is an advanced LLM (Large Language Model) that we have developed over the past four months. Grok-1 has gone through multiple iterations during this time.

After the release of xAI, we trained a prototype LLM (Grok-0) with 33 billion parameters. This early model approached the functionality of LLaMa 2 (70B) on standard LM benchmarks, but with only half the training resources. Over the past two months, we have made significant progress in inference and encoding capabilities, resulting in the creation of Grok-1, an advanced language model that achieves 63.2% on the HumanEval encoding task and 73% on the MMLU task.

To understand the improvements in capabilities with Grok-1, we conducted a series of evaluations using standard machine learning benchmarks that measure mathematical and reasoning abilities.

GSM8k: Middle school math word problems (Cobbe et al. 2021) with thinking chain prompts.

MMLU: Multidisciplinary multiple-choice questions (Hendrycks et al. 2021) with 5 context examples.

HumanEval: Python code completion tasks (Chen et al. 2021) evaluated for pass@1 zero-shot.

MATH: Middle and high school math problems written in LaTeX (Hendrycks et al. 2021) with fixed 4-shot prompts.

In these benchmark tests, Grok-1 has shown outstanding performance, surpassing all other models in its computational category, including ChatGPT-3.5 and Inflection-1. Only models like GPT-4, which have been trained with a large amount of data and computational resources, can surpass it. This demonstrates the rapid progress we have made in training LLMs with exceptional efficiency in xAI.

Since these benchmark tests may be available on the internet, we cannot rule out the possibility that our model unintentionally received training from these benchmark tests. Therefore, we manually scored our model (as well as Claude-2 and GPT-4) on the Hungarian National High School Math Exam in 2023, which was released at the end of May, after we collected our dataset. Grok passed the exam with a score of C (59%), while Claude-2 achieved the same score (55%), and GPT-4 received a B with a score of 68%. All models were evaluated at a temperature of 0.1 and with the same prompt. It should be noted that we made no effort to tailor this evaluation. This experiment serves as a "real-world" test on a dataset that our model has never been explicitly tuned for.

Human-graded evaluation	Grok-0	GPT-3.5	Claude 2	Grok-1	GPT-4
Hungarian National High School Math Exam (May 2023)	37%1-shot	41%1-shot	55%1-shot	59%1-shot	68%1-shot

We provide a summary of the important technical details of Grok-1 in our model card.

xAI Engineering Design#

At the forefront of deep learning research, reliable infrastructure must be carefully built, just like datasets and learning algorithms. To create Grok, we have built a custom training and inference stack based on Kubernetes, Rust, and JAX.

LLM training progresses rapidly like a train, and if one car derails, the entire train can be pulled off track, making it difficult to get back on. GPU failures can occur in various ways: manufacturing defects, loose connections, incorrect configurations, degraded memory chips, occasional random bit flips, and more. During training, we synchronize computations across thousands of GPUs for months, making these failure patterns frequent due to the scale. To overcome these challenges, we have developed a custom distributed system that ensures immediate identification and automated handling of each type of failure. At xAI, we prioritize maximizing useful computation per watt as a key focus of our efforts. Over the past few months, our infrastructure has allowed us to minimize downtime and maintain high model flop utilization (MFU) even in the face of unreliable hardware.

Rust has proven to be an ideal choice for building scalable, reliable, and maintainable infrastructure. It provides high performance, a rich ecosystem, and can prevent most errors typically encountered in distributed systems. Given the small size of our team, the reliability of our infrastructure is crucial, as maintenance would otherwise limit innovation. Rust gives us confidence that any code modification or refactoring has the potential to generate work programs that require minimal supervision for months.

We are now preparing for the next leap in our model's capabilities, which will require reliable coordination of training runs across tens of thousands of accelerators, running internet-scale data pipelines, and integrating new types of functionality and tools into Grok. If this sounds exciting to you, please apply to join our team.

xAI Research#

We provide Grok with search tools and real-time access to information, but like all LLMs based on next-token prediction, our model can still generate false or contradictory information. We believe that achieving reliable reasoning is the most important research direction to address the current limitations of the system. Here, we would like to highlight some promising research directions that excite us at xAI:

Scalable tool-assisted supervision: Human feedback is crucial. However, providing consistent and accurate feedback can be challenging, especially when dealing with lengthy code or complex reasoning steps. AI can assist in scalable supervision by looking up references from different sources, using external tools to validate intermediate steps, and seeking human feedback when necessary. Our goal is to make the most effective use of our AI tutors' time with the help of our model as an AI assistant.
Integration with formal verification for safety, reliability, and foundations: To create AI systems with deep thinking capabilities, we plan to cultivate reasoning abilities in less ambiguous and more verifiable contexts. This allows us to evaluate our systems without the need for human feedback or interaction with the real world. One major goal of this approach is to provide formal guarantees for code correctness, particularly in the context of AI safety.
Long-term context understanding and retrieval: Training models in specific contexts to efficiently discover useful knowledge is at the core of building truly intelligent systems. We are researching methods to discover and retrieve information when needed.
Adversarial robustness: Adversarial examples have shown that optimizers can easily exploit vulnerabilities in AI systems, leading to serious errors not only during training but also during service time. These vulnerabilities have been long-standing weaknesses of deep learning models. We are particularly focused on improving the robustness of LLMs, reward models, and monitoring systems.
Multimodal capabilities: Currently, Grok does not have other senses such as vision and audio. To better assist users, we aim to equip Grok with these different senses, enabling a wider range of applications, including real-time interaction and assistance.

We believe that AI has tremendous potential to provide significant scientific and economic value to society, and therefore, we will strive to develop reliable safeguards to prevent catastrophic malicious use. We believe that every effort should be made to ensure that AI remains a positive force.

If you want to contribute to our mission, please apply to join our team.