The Arc Prize Foundation, co-founded by prominent AI researcher François Chollet. Has launched a new benchmark designed to test the general intelligence of today’s most advanced AI models. And so far, the results are humbling. In a blog post Monday, the nonprofit unveiled ARC-AGI-2. A new version of its intelligence test aimed at pushing AI systems beyond memorization and brute force computing. The early leaderboard shows that even top-tier models from OpenAI and DeepSeek are struggling. OpenAI’s o1-pro and DeepSeek’s R1 models score between 1% and 1.3%, while large models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash hover around 1% accuracy.
The ARC-AGI tests challenge models with visual pattern puzzles — grids of colored squares — designed to measure reasoning and adaptability. Instead of relying on training data or sheer computational power, models must interpret entirely new problems on the fly.
According to the Arc Prize Foundation, over 400 human participants took the ARC-AGI-2 test to establish a baseline. Human panels averaged 60% accuracy — far surpassing AI performance. François Chollet described ARC-AGI-2 as a more rigorous measure of AI’s true intelligence compared to the original ARC-AGI-1, which was released five years ago.
One of the critical changes in ARC-AGI-2 is the introduction of efficiency as a core metric. Models are now tested not only on their ability to solve the task but also on how cost-effective they are. The goal is to push beyond the brute force methods that undermined the first benchmark. As co-founder Greg Kamradt explained, “The real question isn’t just ‘Can AI solve this?’ but ‘At what efficiency or cost?’ That’s the true mark of intelligence.”
ARC-AGI-1, once unbeaten for years, was finally cracked in December 2024 when OpenAI’s o3 model reached 75.7% accuracy — but only after burning through $200 worth of computing power per task. On ARC-AGI-2, that same o3 model scored a disappointing 4%, proving that raw power is no longer enough.
The launch of ARC-AGI-2 comes as pressure mounts in the AI industry for more meaningful benchmarks. Many researchers argue that current tests fail to capture the essence of artificial general intelligence (AGI), particularly when it comes to reasoning and creativity. Hugging Face co-founder Thomas Wolf recently noted the industry’s need for unsaturated metrics to properly measure AGI traits.
To drive progress, the Arc Prize Foundation also announced the Arc Prize 2025 challenge, daring developers to build models that reach 85% accuracy on ARC-AGI-2 while keeping compute costs at just $0.42 per task — a dramatic reduction from current AI spending.
As Chollet emphasized, ARC-AGI-2 represents a fundamental shift: intelligence isn’t just about solving problems — it’s about solving them efficiently. And based on early results, most of today’s leading AI models still have a long way to go.