OpenAI’s DeepResearch Achieves Landmark 26% Score on ‘Humanity’s Last Exam’

Artificial intelligence is making significant strides in approaching human-level expertise. OpenAI’s cutting-edge autonomous AI agent, DeepResearch, has set a new record by completing 26.6% of ‘Humanity’s Last Exam’ – a globally recognized benchmark for testing AI’s ability to outperform world-class human experts across multiple disciplines.

A Groundbreaking Milestone in AI Research

Developed using OpenAI’s frontier o3 model, DeepResearch has revolutionized AI-driven research by synthesizing vast amounts of information and solving complex, multi-step problems in just five to thirty minutes.

The benchmark comprises 3,000+ questions covering topics as diverse as rocket science, mathematics, humanities, and analytic philosophy.

Previously, OpenAI’s o1 model and DeepSeek’s R1 model led the leaderboard but only managed to complete approximately 9% of the exam.

DeepResearch’s near three-fold improvement demonstrates AI’s accelerated progress, especially in fields like chemistry, social sciences, and mathematics.

Comparing AI Performance on ‘Humanity’s Last Exam’

AI Model	Completion Rate (%)
OpenAI’s DeepResearch	26.6%
OpenAI’s o1	9%
DeepSeek’s R1	9%

Surpassing Previous AI Benchmarks

Frank Downing, Director of Research at Ark Invest, highlighted that OpenAI’s DeepResearch also secured a record-breaking score on GAIA – a test designed for AI assistants to tackle real-world queries that are simple for humans but challenging for AI.

DeepResearch’s superior analytical and research abilities have positioned it as a dominant player in the field, outperforming Google’s latest AI model launched in December.

Despite these advancements, Downing suggests that these achievements might soon appear trivial if AI continues to progress at this pace.

He predicts that ‘Humanity’s Last Exam’ could be fully mastered within 12 months, signifying a pivotal moment where AI surpasses expert-level knowledge and reasoning.

What is ‘Humanity’s Last Exam’?

The benchmark was developed by Dan Hendrycks, Director of the Center for AI Safety, in collaboration with Scale AI and other experts.

Inspired by a conversation with Elon Musk, Hendrycks sought to create an exam more rigorous than previous AI tests, such as the Massive Multitask Language Understanding (MMLU) exam.

Musk criticized MMLU for being “undergrad level” and proposed a test requiring expertise at the level of world-class professionals.

The resulting ‘Humanity’s Last Exam’ is designed as the ultimate closed-ended academic test, incorporating questions submitted by renowned college professors, prize-winning mathematicians, and field specialists.

Hendrycks emphasized that the test focuses on high-level mathematical reasoning, a critical skill applicable across various academic disciplines.

He stated that once AI models achieve a 50%+ score, it will mark the dawn of artificial general intelligence (AGI)—a state where machines possess human-like cognitive abilities.

The Implications of Reaching AGI

The rapid improvement of DeepResearch has sparked discussions on the imminence of AGI. OpenAI’s CEO, Sam Altman, expressed confidence in the organization’s ability to construct AGI based on existing research and advancements.

Google DeepMind’s CEO, Demis Hassabis, echoed similar sentiments, predicting that AGI could emerge within five years.

Speaking at the AI Action Summit in Paris, Hassabis urged global leaders to prepare for the societal and economic impact of this technological evolution.

Conclusion

OpenAI’s DeepResearch has redefined the landscape of AI-driven knowledge synthesis, achieving an unprecedented 26.6% score on ‘Humanity’s Last Exam.’

As AI rapidly advances, the possibility of AGI surpassing human intellect is no longer a distant theory but a tangible reality within the next decade.

Whether this development will bring breakthroughs in scientific discovery or present ethical and existential challenges remains to be seen. One thing is certain: the future of AI is closer than ever.

FAQs

1. What is OpenAI’s DeepResearch?

DeepResearch is an advanced autonomous AI agent developed by OpenAI, designed to synthesize complex information and conduct high-level research across multiple disciplines.

2. What is ‘Humanity’s Last Exam’?

‘Humanity’s Last Exam’ is a benchmark test designed to assess AI’s ability to answer expert-level questions across 3,000+ topics, developed by Dan Hendrycks and his team.

3. How did DeepResearch perform compared to other AI models?

DeepResearch achieved a 26.6% score, significantly outperforming OpenAI’s o1 and DeepSeek’s R1 models, both of which only managed 9% completion.