Wednesday, April 16, 2025
HomeAIDebates over AI benchmarking have reached Pokmon

Debates over AI benchmarking have reached Pokmon

Share


Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

Popular

Related Articles

Nvidia H20 chip exports hit with license requirement by US government

Semiconductor giant Nvidia is facing unexpected new U.S. export controls on its H20...

The Impact of AI on the Human Brain

Avi Loeb is the head of the Galileo Project, founding director of Harvard University’s — Black...

Notorious image board 4chan hacked and internal data leaked

Notorious internet forum 4chan was hacked on Tuesday.  At the time of...

Figuring Out What Lies Outside the Solar System is the Day Job of Astronomers, not Government

Figuring Out What Lies Outside the Solar System is the Day Job of Astronomers,...

Apple details how it plans to improve its AI models by privately analyzing user data

In the wake of criticism over the underwhelming performance of its AI products,...

OpenAI plans to phase out GPT-4.5, its largest-ever AI model, from its API

OpenAI said on Monday that it would soon wind down the availability of...

Bill Gates-backed Arnergy to expand solar access in Nigeria with $18M as demand surges

Demand for solar energy in power-starved Nigeria has soared in the last decade...

Access to future AI models in OpenAIs API may require a verified ID

OpenAI may soon require organizations to complete an ID verification process in order...
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x