Major AI Models Tested on Video Games: Performance and Strategic Insights

By Two Minute Papers
What a time to be alive!

16 Oct 2025

Leading AI models underwent comprehensive testing on popular video games like Tetris, Super Mario, and Sokoban, challenging them beyond conventional benchmarks. The experiment uncovered the emergence of genuine planning capabilities in some models and demonstrated impressive cross-game knowledge transfer, suggesting a new understanding of AI intelligence.

Key Points Summary

Introduction to AI Gaming Tests
Major AI models were put to the test on gaming benchmarks such as Tetris, Super Mario, and Sokoban, instead of the usual academic benchmarks, revealing incredible findings regarding their intelligence.
Llama 4 Performance Overview
Llama 4, while known for its strong performance in traditional benchmarks, unfortunately struggled significantly with gameplay across various titles.
Tetris Performance of Previous Models
Earlier AI models, when playing Tetris, consistently left many gaps and barely formed lines, leading to rapid collapses.
OpenAI o4-mini in Tetris
OpenAI’s o4-mini demonstrated improved performance in Tetris, holding longer than previous models, but still failed to clear a single line.
DeepSeek R1 in Tetris
DeepSeek R1 started with a promising attempt in Tetris, successfully forming a line, but this initial success quickly unraveled.
Claude 4 Opus General Performance
Claude 4 Opus primarily outcompeted other AIs by simply losing later, rather than achieving outright wins, with its point system reflecting only pieces placed before game over.
OpenAI o3-pro's Strategic Planning in Tetris
OpenAI's o3-pro exhibited remarkable foresight in Tetris, consistently clearing line after line and appearing to plan ahead, proving very impressive by not failing by the experiment's conclusion.
GPT 4o in Super Mario
GPT 4o did not perform well in Super Mario, failing to secure any notable achievements in the game.
Claude 3.5 in Super Mario
Claude 3.5 in Super Mario displayed moments of apparent intelligence, such as finding a hidden block, but then inexplicably plunged into an abyss.
Claude 3.7 in Super Mario
Claude 3.7 showed better performance in Super Mario, crushing Goombas and bravely leaping over pits, even uncovering a star, but ultimately succumbed to disaster just before the finish line, resembling human-like mistakes after a strong run.
Overall Best Performer: OpenAI o3
OpenAI’s o3 consistently emerged as the best overall performer, significantly excelling in games like Super Mario, Sokoban, and Candy Crush, demonstrating a quantum leap in capability compared to other models.
Gemini 2.5 Flash in Sokoban
Gemini 2.5 Flash successfully completed the first level of Sokoban but struggled significantly with the second, making critical mistakes early on.
OpenAI o3's Planning in Sokoban
OpenAI’s o3 showcased advanced planning in Sokoban, successfully navigating the complex second level by understanding spatial constraints and allowing the level to almost solve itself, though it stalled after level 5.
OpenAI o3-pro's Sokoban Mastery
The enhanced o3-pro successfully completed all 6 levels in Sokoban, demonstrating its advanced problem-solving abilities.
AI Execution Speed in Gaming
The AI models executed moves very slowly during these gaming tasks because these tasks are not their primary intended function.
The 'Harness' for Textual Game Representation
Researchers developed a 'harness,' which is a textual representation of the game fed to the AIs at each step to prompt their next action, enabling them to play various games including Ace Attorney.
Key Insight: Emergence of Genuine Planning
For the first time, genuine planning and strategic thinking are beginning to emerge in large AI models during gameplay, an absolutely incredible finding despite the slow execution.
Key Insight: Games as Rich Benchmarks
Games provide an incredibly rich and challenging testbed for evaluating core AI capabilities, demanding long-term planning and adaptation in ways few other benchmarks can, thereby revealing true strengths and weaknesses.
Key Insight: Cross-Game Knowledge Transfer
After training on Sokoban, AIs improved their spatial reasoning skills and performed up to 8% better in the previously unseen Tetris, demonstrating effective knowledge reuse between different game types.

After training on Sokoban, the AIs improve their spatial reasoning skills, and when they play the previously unseen Tetris, they do better, up to 8% better, just from reusing their knowledge learned in Sokoban.

Under Details

AI Model	Game Tested	Performance/Insight
Llama 4	General Gameplay	Struggled with gameplay despite strong benchmark performance.
OpenAI o4-mini	Tetris	Held longer than previous models but failed to clear any lines.
DeepSeek R1	Tetris	Initially formed a line but its promising start quickly unraveled.
Claude 4 Opus	General Gameplay	Primarily outcompeted others by losing later, not by winning; points reflect pieces placed.
OpenAI o3-pro	Tetris, Sokoban	Demonstrated genuine planning, consistently cleared lines in Tetris, and finished all 6 Sokoban levels.
GPT 4o	Super Mario	Did not perform well.
Claude 3.5	Super Mario	Appeared smart (found hidden block) but made inexplicable mistakes (dove into abyss).
Claude 3.7	Super Mario	Improved play (crushed Goombas, leapt pits, found star) but failed before finishing, akin to human error.
OpenAI o3	Super Mario, Sokoban, Candy Crush	Consistently the best overall performer, showing a quantum leap in capability.
Gemini 2.5 Flash	Sokoban	Finished first level but struggled significantly with the second due to critical early mistakes.
AIs (General)	Various Games	Execution was very slow due to the textual 'harness' mechanism; tasks are not their primary design.
AIs (General)	Sokoban -> Tetris	Improved spatial reasoning and performed 8% better in unseen Tetris after Sokoban training, showing knowledge transfer.

Related Tags

Gaming

Impressive

OpenAI

Tetris

Sokoban

SuperMario

Major AI Models Tested on Video Games: Performance and Strategic Insights

Key Points Summary

Under Details

Tags

Share this post

Other Posts

Related Tags

Major AI Models Tested on Video Games: Performance and Strategic Insights

Key Points Summary

Under Details

Tags

Share this post

Other Posts

Federal Reserve Officials' Divergent Views and Economic Indicators

Ethereum's Ascendance Amidst Regulatory Clarity and Market Shifts

A Personal Formula for Overcoming Depression and Fostering Growth

Related Tags