16 Oct 2025
Leading AI models underwent comprehensive testing on popular video games like Tetris, Super Mario, and Sokoban, challenging them beyond conventional benchmarks. The experiment uncovered the emergence of genuine planning capabilities in some models and demonstrated impressive cross-game knowledge transfer, suggesting a new understanding of AI intelligence.

Major AI models were put to the test on gaming benchmarks such as Tetris, Super Mario, and Sokoban, instead of the usual academic benchmarks, revealing incredible findings regarding their intelligence.
Llama 4, while known for its strong performance in traditional benchmarks, unfortunately struggled significantly with gameplay across various titles.
Earlier AI models, when playing Tetris, consistently left many gaps and barely formed lines, leading to rapid collapses.
OpenAI’s o4-mini demonstrated improved performance in Tetris, holding longer than previous models, but still failed to clear a single line.
DeepSeek R1 started with a promising attempt in Tetris, successfully forming a line, but this initial success quickly unraveled.
Claude 4 Opus primarily outcompeted other AIs by simply losing later, rather than achieving outright wins, with its point system reflecting only pieces placed before game over.
OpenAI's o3-pro exhibited remarkable foresight in Tetris, consistently clearing line after line and appearing to plan ahead, proving very impressive by not failing by the experiment's conclusion.
GPT 4o did not perform well in Super Mario, failing to secure any notable achievements in the game.
Claude 3.5 in Super Mario displayed moments of apparent intelligence, such as finding a hidden block, but then inexplicably plunged into an abyss.
Claude 3.7 showed better performance in Super Mario, crushing Goombas and bravely leaping over pits, even uncovering a star, but ultimately succumbed to disaster just before the finish line, resembling human-like mistakes after a strong run.
OpenAI’s o3 consistently emerged as the best overall performer, significantly excelling in games like Super Mario, Sokoban, and Candy Crush, demonstrating a quantum leap in capability compared to other models.
Gemini 2.5 Flash successfully completed the first level of Sokoban but struggled significantly with the second, making critical mistakes early on.
OpenAI’s o3 showcased advanced planning in Sokoban, successfully navigating the complex second level by understanding spatial constraints and allowing the level to almost solve itself, though it stalled after level 5.
The enhanced o3-pro successfully completed all 6 levels in Sokoban, demonstrating its advanced problem-solving abilities.
The AI models executed moves very slowly during these gaming tasks because these tasks are not their primary intended function.
Researchers developed a 'harness,' which is a textual representation of the game fed to the AIs at each step to prompt their next action, enabling them to play various games including Ace Attorney.
For the first time, genuine planning and strategic thinking are beginning to emerge in large AI models during gameplay, an absolutely incredible finding despite the slow execution.
Games provide an incredibly rich and challenging testbed for evaluating core AI capabilities, demanding long-term planning and adaptation in ways few other benchmarks can, thereby revealing true strengths and weaknesses.
After training on Sokoban, AIs improved their spatial reasoning skills and performed up to 8% better in the previously unseen Tetris, demonstrating effective knowledge reuse between different game types.
After training on Sokoban, the AIs improve their spatial reasoning skills, and when they play the previously unseen Tetris, they do better, up to 8% better, just from reusing their knowledge learned in Sokoban.
| AI Model | Game Tested | Performance/Insight |
|---|---|---|
| Llama 4 | General Gameplay | Struggled with gameplay despite strong benchmark performance. |
| OpenAI o4-mini | Tetris | Held longer than previous models but failed to clear any lines. |
| DeepSeek R1 | Tetris | Initially formed a line but its promising start quickly unraveled. |
| Claude 4 Opus | General Gameplay | Primarily outcompeted others by losing later, not by winning; points reflect pieces placed. |
| OpenAI o3-pro | Tetris, Sokoban | Demonstrated genuine planning, consistently cleared lines in Tetris, and finished all 6 Sokoban levels. |
| GPT 4o | Super Mario | Did not perform well. |
| Claude 3.5 | Super Mario | Appeared smart (found hidden block) but made inexplicable mistakes (dove into abyss). |
| Claude 3.7 | Super Mario | Improved play (crushed Goombas, leapt pits, found star) but failed before finishing, akin to human error. |
| OpenAI o3 | Super Mario, Sokoban, Candy Crush | Consistently the best overall performer, showing a quantum leap in capability. |
| Gemini 2.5 Flash | Sokoban | Finished first level but struggled significantly with the second due to critical early mistakes. |
| AIs (General) | Various Games | Execution was very slow due to the textual 'harness' mechanism; tasks are not their primary design. |
| AIs (General) | Sokoban -> Tetris | Improved spatial reasoning and performed 8% better in unseen Tetris after Sokoban training, showing knowledge transfer. |
