What It Does
MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating large language model knowledge and reasoning across 57 academic subjects, including STEM, humanities, social sciences, and professional domains. Created by Dan Hendrycks et al. and published at ICLR 2021, it consists of approximately 15,908 multiple-choice questions (4 answer choices each) drawn from freely available practice exams and academic materials.
MMLU became the de facto standard for comparing LLM capabilities between 2021 and 2024, appearing in virtually every model release announcement. It is now effectively saturated: frontier models score above 90%, and a documented 6.5% question error rate caps the maximum meaningful score at approximately 93-94%.
Key Features
- 15,908 multiple-choice questions across 57 subjects spanning elementary through professional difficulty
- Subjects include abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, computer security, formal logic, global facts, jurisprudence, machine learning, moral scenarios, philosophy, virology, and more
- Zero-shot and few-shot evaluation protocols
- Standardized splits: dev (5 examples per subject for few-shot), validation, and test
- Hosted on Hugging Face for easy programmatic access
- Widely integrated into evaluation harnesses (lm-evaluation-harness, Inspect AI, etc.)
Use Cases
- Historical comparison: Tracking LLM progress from GPT-3 (43.9%) through GPT-4 (86.4%) to current models (90%+)
- Baseline capability check: Quick sanity test that a model has broad knowledge coverage
- Research reference: Citing as a standard metric in academic papers (though increasingly supplemented by harder benchmarks)
- NOT recommended for: Distinguishing frontier models from each other, evaluating reasoning depth, or assessing real-world task completion ability
Adoption Level Analysis
Small teams (<20 engineers): Trivially easy to run — the dataset is freely available on Hugging Face, and evaluation scripts exist in every major framework. Useful as a quick baseline check but provides no discriminating power for frontier models.
Medium orgs (20-200 engineers): Included in standard evaluation suites by default. Teams should supplement with harder benchmarks (MMLU-Pro, HLE, domain-specific evals) for meaningful differentiation.
Enterprise (200+ engineers): Still reported by major labs for historical continuity, but no serious evaluation relies on MMLU alone. Labs running frontier model evaluations have moved to MMLU-Pro, GPQA, HLE, and task-based benchmarks like HCAST.
Alternatives
| Alternative | Key Difference | Prefer when… |
|---|---|---|
| MMLU-Pro | 10 answer choices, harder questions, less saturation | You need knowledge evaluation with more headroom (but also approaching saturation) |
| Humanity’s Last Exam (HLE) | Expert-curated, 2,500 questions, frontier models score 40-50% | You need a benchmark that still discriminates between frontier models |
| GPQA | Graduate-level science Q&A with diamond-hard subset | You need evaluation of deep domain expertise |
| ARC (AI2 Reasoning Challenge) | Grade-school science reasoning | You need a simpler reasoning benchmark for weaker models |
| HCAST (METR) | Agentic software tasks with human calibration | You need to evaluate autonomous task completion, not knowledge |
Evidence & Sources
- Are We Done with MMLU? (arXiv: 2406.04127) — MMLU-Redux study documenting 6.5% error rate
- Errors in the MMLU (Daniel Erenrich, Medium)
- MMLU Wikipedia
- Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects (Brenndoerfer)
- Mapping global dynamics of benchmark creation and saturation in AI (Nature Communications)
Notes & Caveats
- Saturated since 2024: Frontier models score 88-93%, making MMLU unable to discriminate between them. Further “improvements” increasingly reflect memorization of incorrect ground-truth labels rather than genuine capability gains.
- 6.5% error rate: The MMLU-Redux study found that 6.5% of questions have errors (wrong answers, ambiguous questions, multiple correct answers). Some subsets are far worse: 57% of Virology questions were flagged as erroneous.
- Prompt sensitivity: Model scores can vary 4-5 percentage points depending on prompt format. GPT-4o showed a 13 percentage point variance on MMLU-Pro across different measurement sources.
- Data contamination: As one of the most widely used benchmarks, MMLU questions have likely been seen by many models during pretraining. This makes score comparisons across model generations unreliable.
- Knowledge vs. capability: MMLU measures factual recall and basic reasoning within a multiple-choice format. It says nothing about a model’s ability to complete tasks, follow complex instructions, or produce extended outputs.
- Successor treadmill: MMLU-Pro was created to address saturation but is itself approaching saturation (frontier models at ~90%). MMLU-ProX extends to 29 languages. This pattern of replacement benchmarks saturating within 1-2 years appears structural.