Debates over AI benchmarking have reached Pokmon

Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x

— Jush (@Jush21e8) April 10, 2025

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

S4 Capital downgrades sales outlook as tariff issues hinder financial outlook

Oil slips on rising OPEC+ output, despite Canadian supply concerns

Australia raises minimum wages by 3.5%

International tourist spending in Europe seen up 11% this year, report says

Could the euro replace the dollar as global reserve currency? Its not getting any lesslikely

S4 Capital downgrades sales outlook as tariff issues hinder financial outlook

Oil slips on rising OPEC+ output, despite Canadian supply concerns

Australia raises minimum wages by 3.5%

International tourist spending in Europe seen up 11% this year, report says

Could the euro replace the dollar as global reserve currency? Its not getting any lesslikely

Debates over AI benchmarking have reached Pokmon

Share

Attacks on the Three Facets of My Identity

Data breach at newspaper giant Lee Enterprises affects 40,000 people

Windsurf says Anthropic is limiting its direct access to Claude AI models

One of Africas most successful founders is back with a new AI startup and already raised $9M

Hinge CMO Jackie Jantos hopes to help make Gen Zers less lonely

Popular

Troubled startup CaaStle is now facing two new lawsuits and more allegations

Elon Musk to fix Community Notes after they contradict Trump

4chan is back online, says its been starved of money

Cosmic Russian Roulette: Death by Dark Energy

Zuck shrugs off DeepSeek, vows to spend hundreds of billions on AI

Sandcastles in the Cosmos

Related Articles

iOS 19: All the rumored changes Apple could be bringing to its new operating system

Attacks on the Three Facets of My Identity

Data breach at newspaper giant Lee Enterprises affects 40,000 people

Windsurf says Anthropic is limiting its direct access to Claude AI models

One of Africas most successful founders is back with a new AI startup and already raised $9M

Hinge CMO Jackie Jantos hopes to help make Gen Zers less lonely

An early Joby Aviation backer might soon be its biggest distributor in Saudi Arabia

Console raises $6.2M from Thrive to free IT teams from mundane tasks with AI

About Us

Popular Category

Editor Picks

Israel recommends it will obstruct Greta Thunbergs help ship from reaching Gaza

South Korean media recommends Kim Jong Uns hair greyed by warship ordeal tension as vessel is lastly refloated

Debates over AI benchmarking have reached Pokmon

Share

Related posts:

Popular

Related Articles

About Us

Popular Category

Editor Picks