Anthropic, which built Claude, the LLM I find most useful, tests each of its models on Pokemon Red (I was a Blue player myself). Earlier models weren’t able to do much, but the latest version, using “extended thinking” (aka reasoning, the trend all the AI providers are after), is on a roll.

Anthropic Pokemon Red

This is more meaningful to me than most benchmarks, and I’m only half-joking. I remember Misty’s badge being hard to get!

You can watch the AI play here: https://www.twitch.tv/claudeplayspokemon.