Imagine this: You’re sitting in a room, staring at a test so challenging that even the brightest minds would need to bring their A-game. Now, replace the human test-taker with an artificial intelligence system. Sounds like science fiction? It’s not. Welcome to Humanity’s Last Exam (HLE), a benchmark designed to push AI to its limits by testing its ability to reason at an expert human level across diverse disciplines.
But here’s the twist: Just as HLE was making waves, OpenAI introduced Deep Research, a new AI agent capable of tackling complex, multi-step research tasks. Could tools like Deep Research help AI systems rise to the challenge posed by benchmarks like HLE? Or does the gap between human expertise and machine reasoning remain too vast to bridge?
In this article, we’ll explore what makes HLE a groundbreaking benchmark, examine its implications for AI research, and discuss how innovations like Deep Research might reshape the landscape.
Understanding Humanity’s Last Exam: The Most Complex AI Reasoning Test
At its core, HLE is unlike any other AI benchmark. With over 3,000 multiple-choice and short-answer questions spanning more than 100 academic subjects, it tests AI’s ability to reason across disciplines as varied as quantum mechanics, moral philosophy, and art history. These aren’t simple trivia questions—they’re open-ended challenges that require creativity, abstract thinking, and multi-step logical reasoning.
Here are a few specific examples of HLE questions:
- Ethics and Technology: “A city council is debating whether to implement facial recognition software for public safety. Discuss the ethical, legal, and societal implications of this decision.”
- Interdisciplinary Problem-Solving: “Analyze the relationship between deforestation rates and global food supply chains, and propose solutions to mitigate negative impacts.”
- Multi-Modal Reasoning: “Examine the following graph showing carbon emissions over time and explain how economic policies influenced the trends observed.”
Many questions are multi-modal, requiring AI systems to process text, images, diagrams, or charts simultaneously. For instance, an AI might be asked to interpret a painting and explain its historical significance while cross-referencing textual descriptions of the artist’s life. This mirrors real-world problem-solving, where humans seamlessly integrate different types of information. However, AI struggles with these tasks because it lacks the holistic understanding and adaptability that come naturally to humans.
The benchmark was collaboratively developed by Scale AI and the Center for AI Safety (CAIS), ensuring its rigor and relevance. Its goal isn’t just to evaluate performance but to probe the very edges of AI’s cognitive capabilities.
How Are AI Systems Performing on HLE?
Despite recent advancements in large language models and multi-modal architectures, leading AI systems have struggled to achieve high accuracy scores on HLE. Here’s a breakdown of how some of the top models performed:
Model | Accuracy (%) |
---|---|
GPT-4o | 3.3 |
Grok-2 | 3.8 |
Claude 3.5 Sonnet | 4.3 |
Gemini Thinking | 6.2 |
OpenAI o1 | 9.1 |
DeepSeek-R1* | 9.4 |
OpenAI o3-mini (medium)* | 10.5 |
OpenAI o3-mini (high)* | 13.0 |
Models marked with an asterisk () are not multi-modal and were evaluated on text-only subsets.*
While these scores represent incremental progress, they also highlight the immense difficulty of replicating human-level reasoning.
OpenAI’s Deep Research: Revolutionizing Multi-Agent AI Problem Solving
OpenAI’s Deep Research has made headlines with its groundbreaking performance on HLE, achieving a new high score of 25.3%—a significant leap from GPT-4o’s previous score of just 3.3%. This achievement marks a pivotal moment in AI research, showcasing the potential of multi-agent systems to tackle complex, expert-level reasoning tasks across diverse disciplines.
1. What It Gets Right
Deep Research’s score of 25.3% represents a substantial improvement over existing models, particularly in areas like chemistry, humanities and social sciences, and mathematics. Here’s a closer look at its accomplishments:
- Chemistry: Deep Research demonstrated an ability to solve intricate chemical equations, analyze molecular structures, and predict reaction outcomes—a testament to its enhanced capacity for logical reasoning in scientific domains.
- Humanities and Social Sciences: The model excelled in interpreting historical texts, analyzing philosophical arguments, and addressing ethical dilemmas. For example, it provided nuanced responses to questions about the societal implications of emerging technologies like facial recognition and gene editing.
- Mathematics: Deep Research showed marked improvement in solving abstract mathematical problems, including those requiring creative leaps beyond algorithmic computation.
One of Deep Research’s standout features is its human-like approach to problem-solving. When faced with unfamiliar or highly specialized topics, it effectively seeks out relevant information, mimicking the way humans consult external resources to fill knowledge gaps. This capability sets it apart from traditional models that rely solely on pre-trained data.
2. Where It Lacks
Despite its impressive performance, Deep Research still falls far short of human-level expertise. A score of 25.3% underscores the immense difficulty of replicating the depth and breadth of human reasoning. Key challenges include:
- Ambiguity and Open-Ended Questions: Deep Research struggles with questions that involve subjective judgment or require creative, out-of-the-box thinking. For instance, ethical dilemmas that demand balancing competing values often leave the system faltering.
- Multi-Modal Reasoning: While it performs better than previous models in integrating text, images, and diagrams, it still encounters difficulties with highly abstract or symbolic content—a hallmark of many HLE questions.
- Interdisciplinary Synthesis: Although Deep Research shows promise in combining insights from different fields, its ability to seamlessly integrate knowledge across vastly different domains remains limited.
3. How Does It Compare to Other Models?
The difference between Deep Research and earlier models like GPT-4o is stark. While GPT-4o is optimized for generating text-based responses, it often falters when faced with tasks requiring sustained, multi-step reasoning. Deep Research, on the other hand, leverages a collaborative framework where multiple agents work together to solve problems iteratively. This makes it particularly effective for tasks like scientific research, policy analysis, and data synthesis.
However, this collaborative approach also introduces new challenges. Coordinating multiple agents requires significant computational resources, and errors in one stage of the process can cascade, leading to inaccurate conclusions. Additionally, while Deep Research outperforms GPT-4o in structured, data-driven tasks, it still lags in areas requiring deep contextual understanding, such as art history or moral philosophy.
Why Does Humanity’s Last Exam Matter?
HLE exists to answer a critical question: Can AI genuinely understand the world as deeply as humans do? According to Dan Hendrycks, co-founder of CAIS, the goal of HLE is “to test the limits of AI knowledge at the frontiers of human expertise.”
The implications of HLE extend beyond academia. By identifying areas where AI falls short, it provides a roadmap for innovation. For example:
- Poor performance on ethical dilemmas suggests the need to enhance AI’s capacity for moral reasoning.
- Struggles with multi-modal questions point to opportunities for improving integration of visual and linguistic processing.
Moreover, HLE influences discussions about AI safety and alignment. If AI systems can’t reliably demonstrate expert-level reasoning, how can we trust them to make decisions affecting millions of lives? By pushing AI to confront its limitations, HLE encourages transparency, accountability, and continuous improvement.
Real-World Applications: Why Passing HLE Matters
If AI systems could pass HLE, the implications for fields like healthcare, law, and policymaking would be transformative. Here are a few specific examples:
- Healthcare: Imagine an AI system capable of synthesizing insights from genetics, psychology, and sociology to develop personalized treatment plans for patients with rare diseases. Passing HLE would mean AI could handle the interdisciplinary reasoning required for such tasks.
- Law: An AI that passes HLE could analyze case law, statutes, and ethical considerations to provide nuanced legal advice—something current systems struggle with due to their inability to grasp context and nuance.
- Policymaking: Policymakers often face complex, multi-faceted problems like climate change or income inequality. An AI that excels at HLE could help craft evidence-based policies by integrating data from economics, sociology, and environmental science.
These scenarios illustrate why passing HLE isn’t just an academic exercise—it’s a step toward creating AI systems that can truly augment human decision-making in critical areas.
Future Directions: Bridging the Gap
To address the challenges identified by HLE, researchers are pursuing several promising directions:
- Hybrid Models: Combining symbolic reasoning with neural networks to improve abstract thinking and logical reasoning.
- Multi-Agent Systems: Building collaborative AI systems like Deep Research to tackle complex, multi-step tasks.
- Ethical Frameworks: Developing frameworks to guide AI decision-making in morally ambiguous situations.
- Interdisciplinary Training: Training AI on datasets that span multiple domains to enhance cross-disciplinary reasoning.
These initiatives hold the potential to narrow the gap between human expertise and machine reasoning—but significant challenges remain.
Looking Ahead: What’s Next for AI?
Humanity’s Last Exam pushes the boundaries of AI, revealing both its strengths and limitations. By providing a rigorous test of expert-level reasoning, HLE highlights the work still needed to create AI systems that can match human intellect. Innovations like Deep Research offer glimpses of progress, but the journey is far from over.
Will we see a day when AI achieves parity with human experts across all domains? Or will there always be dimensions of human thought—creativity, empathy, intuition—that remain uniquely ours? As AI continues to evolve, HLE stands as both a challenge and a promise: a challenge to build machines that can truly think like humans, and a promise to unlock new possibilities along the way.
Conclusion
Humanity’s Last Exam pushes the boundaries of AI, revealing both its strengths and limitations. By providing a rigorous test of expert-level reasoning, HLE highlights the work still needed to create AI systems that can match human intellect. Innovations like Deep Research offer glimpses of progress, but the journey is far from over.
As we look to the future, one question lingers: If AI ever does pass Humanity’s Last Exam, will we still recognize ourselves in the reflection?
References
- The Indian Express (2025) - Deep Research by OpenAI Scores 25.3% on Humanity’s Last Exam
- OpenAI (2025) - Introducing Deep Research: A Multi-Agent System for Complex Problem Solving
- Analytics India Magazine (2025) - How Multi-Agent Systems Are Revolutionizing AI Research
- Gizbot (2025) - OpenAI Launches Deep Research: A New Tool for Data Analysis and Policy Making
- Decrypt (2025) - OpenAI Responds to DeepSeek Hype with Deep Research ChatGPT Agent
- Techzine Europe (2025) - OpenAI Comes Out with Deep Research: The Answer to DeepSeek
- Reddit (2025) - Deep Research Model Achieves 26.6% on Humanity’s Last Exam
- Cybersecurity News (2025) - ChatGPT Announces Deep Research: A Step Toward Ethical AI
- Analytics Insight (2025) - OpenAI Unveils Deep Research: A Game-Changer for AI-Powered Investigations
- Twitter Thread by Ren Hongyu (2025) - Deep Research vs. Humanity’s Last Exam: What’s Next?