As traditional methods of measuring generative AI performance fall short, developers are now turning to creative alternatives — and one surprising playground is Minecraft. Known for its iconic blocky world, the best-selling video game has become an experimental arena where AI models are tested and ranked based on their building skills.
At the heart of this innovative effort is MC-Bench (Minecraft Benchmark) — a website designed to let users compare AI creations inside Minecraft. Built by a small team of volunteers, MC-Bench allows users to vote on which AI-generated build best matches a specific prompt, such as crafting a snowman or designing a tropical beach hut. Only after casting their vote can users see which AI model created each build.
Interestingly, the project was started by Adi Singh, a high school senior, who believes Minecraft’s universal appeal makes it the perfect tool for measuring AI progress. “People instinctively understand Minecraft,” Singh explained. “They’re familiar with the style, the look, and the vibe — which makes it easier to see how much AI models have improved.”
Currently, eight contributors are working voluntarily on the project. While major AI players like Anthropic, Google, OpenAI, and Alibaba have provided some resources — like free access to their models — they have no formal ties to MC-Bench.
For now, the benchmark focuses on simpler builds designed to reflect how far AI has come since the GPT-3 era. However, Singh has bigger plans. “We could scale to complex, long-form builds requiring goal-oriented tasks,” he shared. “Video games might become the safest and most controllable environments to test agentic reasoning — making them ideal for pushing AI limits without real-world risks.”
Minecraft isn’t the first game to serve as a testbed for AI. Titles like Pokémon Red, Street Fighter, and even Pictionary have previously been used to challenge AI models. This trend stems from the difficulty of creating meaningful benchmarks that truly reflect an AI model’s capabilities. Standardized tests, while useful, often favor models because their training heavily overlaps with the content of those tests — giving them an unfair advantage.
For example, while GPT-4 impressively scores in the 88th percentile on the LSAT, it struggles with simple tasks like counting the number of ‘R’s in the word “strawberry.” Similarly, Claude 3.7 Sonnet by Anthropic managed a 62.3% accuracy rate on a software engineering benchmark but performs worse at Pokémon than most kids.
What sets MC-Bench apart is its simplicity and broader appeal. Though the challenge technically involves programming — models must write code to generate each Minecraft build — most users aren’t interested in the code. Instead, they judge based on visual appeal, making the benchmark accessible to more people and gathering more valuable comparison data.
Whether this Minecraft-based leaderboard offers real insight into AI model performance is still open for debate. Singh, however, is confident in its potential. “The leaderboard aligns pretty well with my own experience using these models, which isn’t something I can say about most text-based benchmarks,” he said. “I think MC-Bench might actually help companies figure out if their models are improving in meaningful ways.”