HomeNewsDid xAI Fabricate Grok 3's Benchmark Results?

Did xAI Fabricate Grok 3’s Benchmark Results?

Published on

Public debates are emerging regarding AI benchmarks and their presentation by AI laboratories. Recently, an employee at OpenAI accused Elon Musk’s AI enterprise, xAI, of disseminating misleading benchmark results for its newest AI model, Grok 3. In response, Igor Babushkin, a co-founder of xAI, defended the company’s position.

The actual situation appears to lie somewhere between these conflicting claims. A post on xAI’s blog presented a graph illustrating Grok 3’s performance on AIME 2025, a collection of challenging math questions from a recent invitational exam. Although some experts have raised concerns about AIME’s validity as an AI benchmark, it remains a frequently used measure to assess a model’s mathematical abilities.

The graph released by xAI depicted two versions of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperforming OpenAI’s top model, o3-mini-high, on AIME 2025. However, OpenAI staff were quick to note that xAI’s graph did not include the o3-mini-high’s score at “cons@64.”

“Cons@64,” or “consensus@64,” allows a model 64 attempts to solve each problem, ultimately considering the most frequently generated answer as the final response. This approach tends to enhance models’ benchmark scores significantly. Excluding it from a graph may give the incorrect impression that one model surpasses another.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s initial scores for AIME 2025 were lower than those of o3-mini-high. Grok 3 Reasoning Beta also slightly lagged behind OpenAI’s o1 model set to “medium” computing. Despite these comparisons, xAI has been promoting Grok 3 as the “world’s smartest AI.”

Babushkin contended on X that OpenAI has previously released similarly misleading benchmark charts. An independent party in the debate compiled a more balanced graph displaying various models’ performances at cons@64:

AI researcher Nathan Lambert highlighted in a post that a crucial metric remains unknown: the computational and monetary expense required for each model to achieve its best score. This aspect underscores how AI benchmarks often fall short in conveying both the limitations and strengths of models.

Source link

Latest articles

Free Livestream: Watch Pakistan vs. India 2025 ICC Champions Trophy

The 2025 ICC Champions Trophy features several significant matches, but the contest between Pakistan...

China Claims Australia Exaggerates Chinese Naval Drills

China has accused Australia of exaggerating the significance of Chinese naval drills. The Chinese...

DOGE Email Causes Chaos and Confusion in Federal Agencies

On Saturday, federal government employees received an email from the Office of Personnel Management...

Why Trump Poses a Threat to Permanent Bureaucracy

During a recent interview with Sean Hannity on Fox News, former President Donald Trump...

More like this

Free Livestream: Watch Pakistan vs. India 2025 ICC Champions Trophy

The 2025 ICC Champions Trophy features several significant matches, but the contest between Pakistan...

China Claims Australia Exaggerates Chinese Naval Drills

China has accused Australia of exaggerating the significance of Chinese naval drills. The Chinese...

DOGE Email Causes Chaos and Confusion in Federal Agencies

On Saturday, federal government employees received an email from the Office of Personnel Management...