Earlier in the week, Meta encountered criticism for utilizing an experimental, unreleased iteration of its Llama 4 Maverick model to secure a high score on the crowdsourced benchmark, LM Arena. This incident led to an apology from the LM Arena maintainers, prompting them to revise their policies and reevaluate the unaltered, standard Maverick model.
The unmodified version of the Maverick model, known as “Llama-4-Maverick-17B-128E-Instruct,” was subsequently ranked below other models, including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro, as of Friday. Many of these models have been available for several months.
Following the disclosure of the experimental model’s use, the release version of Llama 4 was incorporated into LM Arena, but it ranked in 32nd place, as noted by a Twitter user.
The reasoning behind this lower performance was partially attributed to the experimental Maverick variant, “Llama-4-Maverick-03-26-Experimental,” which Meta explained was optimized for conversationality. These optimizations appeared to favor LM Arena’s evaluation process, where human raters compare model outputs and determine their preferences.
Historically, LM Arena has not always been regarded as the most accurate measure of an AI model’s performance for various reasons. Adjusting a model specifically for this benchmark may present challenges in predicting its performance in other contexts, according to experts.
In a statement to TechCrunch, a Meta spokesperson explained that the company experiments with various custom variants of its models.
The spokesperson elaborated, “‘Llama-4-Maverick-03-26-Experimental’ is a chat-optimized version we experimented with that also performs well on LMArena.” They also highlighted the release of an open-source version and expressed anticipation for how developers will tailor Llama 4 to suit their unique applications, looking forward to receiving their feedback.