Humanity's Last Exam (HLE)

★ New
assess
AI / ML open-source CC-BY-4.0 open-source

What It Does

Humanity’s Last Exam (HLE) is a multi-modal benchmark consisting of 2,500 expert-level academic questions designed to be the hardest broad-coverage closed-ended benchmark for AI systems. Created through a global collaborative effort involving nearly 1,000 subject-matter experts (mostly professors and researchers) affiliated with over 500 institutions across 50 countries, HLE was explicitly designed to resist the benchmark saturation problem: any question that an AI system could answer during curation was removed.

Published in Nature (volume 649, pp. 1139-1146, 2026), HLE covers over 100 subjects spanning mathematics, humanities, natural sciences, and specialized professional domains. Questions are a mix of multiple-choice and short-answer formats, designed for automated grading while requiring deep domain expertise that cannot be resolved through simple internet retrieval.

Key Features

  • 2,500 questions across 100+ academic subjects, curated by ~1,000 domain experts from 500+ institutions in 50 countries
  • Multi-modal: includes questions requiring image, diagram, and mathematical notation interpretation
  • Adversarial filtering: questions answerable by any AI system during curation were removed
  • Mixed format: multiple-choice and short-answer with unambiguous, verifiable solutions
  • Published in Nature (2026), providing peer-reviewed scientific credibility
  • Leaderboard hosted by Scale AI Labs
  • Designed to measure the gap between current AI and comprehensive expert-level knowledge

Use Cases

  • Frontier model evaluation: Discriminating between state-of-the-art models where MMLU and MMLU-Pro are saturated (frontier models score 40-50% on HLE vs. 90%+ on MMLU)
  • AI progress tracking: Monitoring how quickly models close the gap to expert human performance across diverse domains
  • Safety threshold monitoring: Serving as an indicator of when AI systems achieve comprehensive expert-level knowledge
  • Research on evaluation methodology: Studying how expert-curated benchmarks resist saturation compared to crowdsourced ones

Adoption Level Analysis

Small teams (<20 engineers): Accessible — the dataset is publicly available. However, running evaluations on frontier models is expensive, and the benchmark is designed for the most capable models. Small teams working with smaller models will see very low scores with limited discriminating value.

Medium orgs (20-200 engineers): Useful as part of an evaluation suite for teams building or fine-tuning large models. Provides genuine differentiation where MMLU cannot.

Enterprise (200+ engineers): Highly relevant for frontier AI labs evaluating new model releases. Cited by OpenAI, Anthropic, and Google DeepMind in capability assessments. Referenced by policymakers assessing AI progress.

Alternatives

AlternativeKey DifferencePrefer when…
MMLU / MMLU-ProBroader but easier; saturated at frontierYou need a quick baseline, not frontier discrimination
GPQA (Diamond)Graduate-level science only, smallerYou need focused evaluation of scientific reasoning
FrontierMath (Epoch AI)Pure mathematics, extremely hardYou need mathematical reasoning evaluation specifically
HCAST (METR)Agentic software tasks, not knowledge Q&AYou need to evaluate autonomous task completion ability
ATLASMultidisciplinary frontier scientific reasoningYou need scientific reasoning across multiple fields

Evidence & Sources

Notes & Caveats

  • Will eventually saturate too: Despite being designed as “the last exam,” HLE will inevitably saturate as models improve. Early results showed GPT-4o at 2.7% and o1 at 8%, but by early 2026 frontier models (Gemini 3.1 Pro, Claude Opus 4.6) reached 40-50%. At this rate, saturation within 1-2 years is plausible. The name is aspirational, not prophetic.
  • Closed-ended format limitation: Like MMLU, HLE uses questions with single correct answers. This format cannot assess open-ended reasoning, creative problem-solving, or multi-step autonomous work.
  • Expert curation is expensive to repeat: The scale of the expert curation effort (1,000 contributors across 500 institutions) makes it difficult to create successor benchmarks at the same quality level. This is a one-shot effort, not a repeatable process.
  • Potential data contamination over time: As HLE questions circulate (the dataset is public), the risk of training data contamination increases. Unlike HCAST, which uses tasks with programmatic verification, HLE’s knowledge-based questions are more susceptible to memorization.
  • Multi-modal questions require specific model capabilities: Not all models support image/diagram interpretation, making direct comparison across model families uneven.
  • Adversarial filtering creates a moving target: Questions removed because models could answer them during curation means the benchmark is calibrated to a specific moment in AI capability. This is a feature (ensures difficulty) but also means the benchmark’s absolute difficulty level is historically contingent.