What It Does

MMLU (Massive Multitask Language Understanding) is a benchmark for evaluating large language model knowledge and reasoning across 57 academic subjects, including STEM, humanities, social sciences, and professional domains. Created by Dan Hendrycks et al. and published at ICLR 2021, it consists of approximately 15,908 multiple-choice questions (4 answer choices each) drawn from freely available practice exams and academic materials.

MMLU became the de facto standard for comparing LLM capabilities between 2021 and 2024, appearing in virtually every model release announcement. It is now effectively saturated: frontier models score above 90%, and a documented 6.5% question error rate caps the maximum meaningful score at approximately 93-94%.

Key Features

15,908 multiple-choice questions across 57 subjects spanning elementary through professional difficulty
Subjects include abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, computer security, formal logic, global facts, jurisprudence, machine learning, moral scenarios, philosophy, virology, and more
Zero-shot and few-shot evaluation protocols
Standardized splits: dev (5 examples per subject for few-shot), validation, and test
Hosted on Hugging Face for easy programmatic access
Widely integrated into evaluation harnesses (lm-evaluation-harness, Inspect AI, etc.)

Use Cases

Historical comparison: Tracking LLM progress from GPT-3 (43.9%) through GPT-4 (86.4%) to current models (90%+)
Baseline capability check: Quick sanity test that a model has broad knowledge coverage
Research reference: Citing as a standard metric in academic papers (though increasingly supplemented by harder benchmarks)
NOT recommended for: Distinguishing frontier models from each other, evaluating reasoning depth, or assessing real-world task completion ability

Adoption Level Analysis

Small teams (<20 engineers): Trivially easy to run — the dataset is freely available on Hugging Face, and evaluation scripts exist in every major framework. Useful as a quick baseline check but provides no discriminating power for frontier models.

Medium orgs (20-200 engineers): Included in standard evaluation suites by default. Teams should supplement with harder benchmarks (MMLU-Pro, HLE, domain-specific evals) for meaningful differentiation.

Enterprise (200+ engineers): Still reported by major labs for historical continuity, but no serious evaluation relies on MMLU alone. Labs running frontier model evaluations have moved to MMLU-Pro, GPQA, HLE, and task-based benchmarks like HCAST.

Alternatives

Alternative	Key Difference	Prefer when…
MMLU-Pro	10 answer choices, harder questions, less saturation	You need knowledge evaluation with more headroom (but also approaching saturation)
Humanity’s Last Exam (HLE)	Expert-curated, 2,500 questions, frontier models score 40-50%	You need a benchmark that still discriminates between frontier models
GPQA	Graduate-level science Q&A with diamond-hard subset	You need evaluation of deep domain expertise
ARC (AI2 Reasoning Challenge)	Grade-school science reasoning	You need a simpler reasoning benchmark for weaker models
HCAST (METR)	Agentic software tasks with human calibration	You need to evaluate autonomous task completion, not knowledge

Evidence & Sources

Are We Done with MMLU? (arXiv: 2406.04127) — MMLU-Redux study documenting 6.5% error rate
Errors in the MMLU (Daniel Erenrich, Medium)
MMLU Wikipedia
Benchmark Saturation: AI Evaluation Metrics and Ceiling Effects (Brenndoerfer)
Mapping global dynamics of benchmark creation and saturation in AI (Nature Communications)

Notes & Caveats

Saturated since 2024: Frontier models score 88-93%, making MMLU unable to discriminate between them. Further “improvements” increasingly reflect memorization of incorrect ground-truth labels rather than genuine capability gains.
6.5% error rate: The MMLU-Redux study found that 6.5% of questions have errors (wrong answers, ambiguous questions, multiple correct answers). Some subsets are far worse: 57% of Virology questions were flagged as erroneous.
Prompt sensitivity: Model scores can vary 4-5 percentage points depending on prompt format. GPT-4o showed a 13 percentage point variance on MMLU-Pro across different measurement sources.
Data contamination: As one of the most widely used benchmarks, MMLU questions have likely been seen by many models during pretraining. This makes score comparisons across model generations unreliable.
Knowledge vs. capability: MMLU measures factual recall and basic reasoning within a multiple-choice format. It says nothing about a model’s ability to complete tasks, follow complex instructions, or produce extended outputs.
Successor treadmill: MMLU-Pro was created to address saturation but is itself approaching saturation (frontier models at ~90%). MMLU-ProX extends to 29 languages. This pattern of replacement benchmarks saturating within 1-2 years appears structural.

MMLU (Massive Multitask Language Understanding)

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

HCAST (Human-Calibrated Autonomy Software Tasks)

Humanity's Last Exam (HLE)

SWE-bench