Public debates are emerging regarding AI benchmarks and their presentation by AI laboratories. Recently, an employee at OpenAI accused Elon Musk’s AI enterprise, xAI, of disseminating misleading benchmark results for its newest AI model, Grok 3. In response, Igor Babushkin, a co-founder of xAI, defended the company’s position.
The actual situation appears to lie somewhere between these conflicting claims. A post on xAI’s blog presented a graph illustrating Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational exam. Although some experts have raised concerns about AIME’s validity as an AI benchmark, it remains a frequently used measure to assess a model’s mathematical abilities.
The graph released by xAI depicted two versions of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperforming OpenAI’s top model, o3-mini-high, on AIME 2025. However, OpenAI staff were quick to note that xAI’s graph did not include the o3-mini-high’s score at “cons@64.”
“Cons@64,” or “consensus@64,” allows a model 64 attempts to solve each problem, ultimately considering the most frequently generated answer as the final response. This approach tends to enhance models’ benchmark scores significantly. Excluding it from a graph may give the incorrect impression that one model surpasses another.
Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s initial scores for AIME 2025 were lower than those of o3-mini-high. Grok 3 Reasoning Beta also slightly lagged behind OpenAI’s o1 model set to “medium” computing. Despite these comparisons, xAI has been promoting Grok 3 as the “world’s smartest AI.”
Babushkin contended on X that OpenAI has previously released similarly misleading benchmark charts. An independent party in the debate compiled a more balanced graph displaying various models’ performances at cons@64:
AI researcher Nathan Lambert highlighted in a post that a crucial metric remains unknown: the computational and monetary expense required for each model to achieve its best score. This aspect underscores how AI benchmarks often fall short in conveying both the limitations and strengths of models.