Major AI Models Tested on Video Games: Performance and Strategic Insights

Leading AI models underwent comprehensive testing on popular video games like Tetris, Super Mario, and Sokoban, challenging them beyond conventional benchmarks. The experiment uncovered the emergence of genuine planning capabilities in some models and demonstrated impressive cross-game knowledge transfer, suggesting a new understanding of AI intelligence.

image

Key Points Summary

  • Introduction to AI Gaming Tests

    Major AI models were put to the test on gaming benchmarks such as Tetris, Super Mario, and Sokoban, instead of the usual academic benchmarks, revealing incredible findings regarding their intelligence.

  • Llama 4 Performance Overview

    Llama 4, while known for its strong performance in traditional benchmarks, unfortunately struggled significantly with gameplay across various titles.

  • Tetris Performance of Previous Models

    Earlier AI models, when playing Tetris, consistently left many gaps and barely formed lines, leading to rapid collapses.

  • OpenAI o4-mini in Tetris

    OpenAI’s o4-mini demonstrated improved performance in Tetris, holding longer than previous models, but still failed to clear a single line.

  • DeepSeek R1 in Tetris

    DeepSeek R1 started with a promising attempt in Tetris, successfully forming a line, but this initial success quickly unraveled.

  • Claude 4 Opus General Performance

    Claude 4 Opus primarily outcompeted other AIs by simply losing later, rather than achieving outright wins, with its point system reflecting only pieces placed before game over.

  • OpenAI o3-pro's Strategic Planning in Tetris

    OpenAI's o3-pro exhibited remarkable foresight in Tetris, consistently clearing line after line and appearing to plan ahead, proving very impressive by not failing by the experiment's conclusion.

  • GPT 4o in Super Mario

    GPT 4o did not perform well in Super Mario, failing to secure any notable achievements in the game.

  • Claude 3.5 in Super Mario

    Claude 3.5 in Super Mario displayed moments of apparent intelligence, such as finding a hidden block, but then inexplicably plunged into an abyss.

  • Claude 3.7 in Super Mario

    Claude 3.7 showed better performance in Super Mario, crushing Goombas and bravely leaping over pits, even uncovering a star, but ultimately succumbed to disaster just before the finish line, resembling human-like mistakes after a strong run.

  • Overall Best Performer: OpenAI o3

    OpenAI’s o3 consistently emerged as the best overall performer, significantly excelling in games like Super Mario, Sokoban, and Candy Crush, demonstrating a quantum leap in capability compared to other models.

  • Gemini 2.5 Flash in Sokoban

    Gemini 2.5 Flash successfully completed the first level of Sokoban but struggled significantly with the second, making critical mistakes early on.

  • OpenAI o3's Planning in Sokoban

    OpenAI’s o3 showcased advanced planning in Sokoban, successfully navigating the complex second level by understanding spatial constraints and allowing the level to almost solve itself, though it stalled after level 5.

  • OpenAI o3-pro's Sokoban Mastery

    The enhanced o3-pro successfully completed all 6 levels in Sokoban, demonstrating its advanced problem-solving abilities.

  • AI Execution Speed in Gaming

    The AI models executed moves very slowly during these gaming tasks because these tasks are not their primary intended function.

  • The 'Harness' for Textual Game Representation

    Researchers developed a 'harness,' which is a textual representation of the game fed to the AIs at each step to prompt their next action, enabling them to play various games including Ace Attorney.

  • Key Insight: Emergence of Genuine Planning

    For the first time, genuine planning and strategic thinking are beginning to emerge in large AI models during gameplay, an absolutely incredible finding despite the slow execution.

  • Key Insight: Games as Rich Benchmarks

    Games provide an incredibly rich and challenging testbed for evaluating core AI capabilities, demanding long-term planning and adaptation in ways few other benchmarks can, thereby revealing true strengths and weaknesses.

  • Key Insight: Cross-Game Knowledge Transfer

    After training on Sokoban, AIs improved their spatial reasoning skills and performed up to 8% better in the previously unseen Tetris, demonstrating effective knowledge reuse between different game types.

After training on Sokoban, the AIs improve their spatial reasoning skills, and when they play the previously unseen Tetris, they do better, up to 8% better, just from reusing their knowledge learned in Sokoban.

Under Details

AI ModelGame TestedPerformance/Insight
Llama 4General GameplayStruggled with gameplay despite strong benchmark performance.
OpenAI o4-miniTetrisHeld longer than previous models but failed to clear any lines.
DeepSeek R1TetrisInitially formed a line but its promising start quickly unraveled.
Claude 4 OpusGeneral GameplayPrimarily outcompeted others by losing later, not by winning; points reflect pieces placed.
OpenAI o3-proTetris, SokobanDemonstrated genuine planning, consistently cleared lines in Tetris, and finished all 6 Sokoban levels.
GPT 4oSuper MarioDid not perform well.
Claude 3.5Super MarioAppeared smart (found hidden block) but made inexplicable mistakes (dove into abyss).
Claude 3.7Super MarioImproved play (crushed Goombas, leapt pits, found star) but failed before finishing, akin to human error.
OpenAI o3Super Mario, Sokoban, Candy CrushConsistently the best overall performer, showing a quantum leap in capability.
Gemini 2.5 FlashSokobanFinished first level but struggled significantly with the second due to critical early mistakes.
AIs (General)Various GamesExecution was very slow due to the textual 'harness' mechanism; tasks are not their primary design.
AIs (General)Sokoban -> TetrisImproved spatial reasoning and performed 8% better in unseen Tetris after Sokoban training, showing knowledge transfer.

Tags

AI
Gaming
Impressive
OpenAI
Tetris
Sokoban
SuperMario
Share this post