On Saturday, Meta released one of its new flagship AI models, Maverick, which has achieved a second-place ranking on LM Arena, a test where human raters evaluate and compare the outputs of different models. However, it appears that the version of Maverick used by Meta in LM Arena differs from the version that is accessible to developers.
Several AI researchers on the platform X noted that Meta’s announcement described the Maverick model on LM Arena as an “experimental chat version.” Additionally, a chart on the official Llama website reveals that Meta’s LM Arena testing involved a version of “Llama 4 Maverick” that was optimized for conversational interactions.
Despite LM Arena’s historical unreliability as a benchmark for evaluating AI model performance, AI companies generally have not customized their models specifically to score higher on this platform, or at least have not disclosed doing so. Modifying a model to perform well on a specific benchmark and then releasing a different version can mislead developers regarding the model’s practical performance in various scenarios. Ideally, benchmarks should offer a comprehensive overview of a model’s strengths and weaknesses across multiple tasks, despite their shortcomings.
Researchers on X have noted significant behavioral differences between the publicly available Maverick model and the version hosted on LM Arena. The LM Arena version reportedly uses numerous emojis and provides lengthy responses.
Meta and Chatbot Arena, the organization responsible for LM Arena, have been contacted for comment regarding these discrepancies.