Why It’s Getting Harder to Measure AI Performance

Source: Understanding AI | Author: Timothy B. Lee | Published: 2026-04-02 Category: research | Credibility: high

Executive Summary

AI benchmarks are hitting saturation at an accelerating rate: MMLU, once the gold standard, is now effectively useless for distinguishing frontier models due to a ~6.5% question error rate capping meaningful scores at ~93%.
METR’s Time Horizons benchmark — the most widely cited metric for AI progress — now produces confidence intervals spanning an order of magnitude (5 to 66 hours for Claude Opus 4.6), undermining its precision as a measurement instrument.
The fundamental challenge is not just creating harder tests, but that real-world work involves ambiguity, stakeholder interaction, and evolving goals that cannot be captured in isolated, well-defined benchmark tasks — this gap will widen as models tackle longer-duration work.

Evidence quality: benchmark (METR’s own published data with statistical methodology)
Assessment: This is accurate and well-documented. METR’s TH1.0 confidence interval for Opus 4.6 is [319, 3949] minutes, which is roughly 5.3 to 65.8 hours. The point estimate is 718 minutes (~12 hours). The logistic regression model is extrapolating beyond the maximum calibrated task difficulty (~8 hours from RE-Bench), which inherently produces wide confidence bands. METR acknowledged this in their TH1.1 update (January 2026), which expanded the task suite by 34% and reduced the upper confidence bound multiplier from 4.4x to 2.3x — but did not eliminate the problem.
Counter-argument: METR researchers would argue that the trend line (7-month doubling) is more informative than any single point estimate, and that TH1.1 has meaningfully tightened intervals. The article somewhat undersells the improvements METR has made to address this exact problem. Additionally, wide confidence intervals at the frontier are expected in any measurement regime pushing the boundaries of its instrument — this does not invalidate the metric, only its precision at extremes.
References:

Evidence quality: peer-reviewed (MMLU-Redux study, published 2024)
Assessment: This is well-supported. The MMLU-Redux study manually reviewed 3,000 questions across 30 subsets and found a 6.5% overall error rate, with some subsets like Virology having 57% erroneous questions. This effectively caps the maximum meaningful score at ~93-94%. GPT-4.1 scored 90.2% in 2025, and further gains are increasingly about memorizing wrong answers rather than demonstrating genuine capability. The article correctly identifies this as an illustration of the broader benchmark lifecycle problem.
Counter-argument: MMLU-Pro was created specifically to address this saturation, with harder questions and 10 answer choices instead of 4. However, MMLU-Pro itself is now approaching saturation with frontier models scoring ~90%. The “benchmark treadmill” problem — where each successor benchmark saturates within 1-2 years — suggests the issue is structural, not just about question quality.
References:

Evidence quality: anecdotal / expert opinion
Assessment: This is the article’s most important and least well-evidenced claim. It is conceptually strong — real software engineering involves ambiguous requirements, organizational context, and iterative feedback loops that no benchmark captures. METR’s own August 2025 finding that 38% algorithmic benchmark success yields 0% mergeable PRs is strong indirect evidence. However, the article does not cite specific studies on the gap between benchmark performance and real-world utility, because such studies barely exist. This is a known-unknown, not an empirically validated position.
Counter-argument: The argument that “benchmarks can never capture real work” proves too much — it would also invalidate all standardized testing of humans (SATs, professional certifications). What benchmarks measure is useful, even if it is not the full picture. The more productive framing is that benchmarks measure a necessary but insufficient subset of capabilities, and the gap between benchmark and real-world performance needs to be characterized, not merely lamented.
References:
- METR: Algorithmic vs. Holistic Evaluation (August 2025)
- When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation (arXiv: 2602.16763)

Evidence quality: anecdotal (based on interview with METR researcher Joel Becker)
Assessment: Plausible and economically sound. At minimum $50/hour, a single 40-hour human baseline attempt costs $2,000+. With METR’s methodology requiring multiple human attempts per task for calibration (they used 563 attempts across 189 tasks for HCAST), scaling to 40+ hour tasks would cost millions. Recruiting domain experts willing to spend a full work week on a single test task is also qualitatively different from recruiting for 1-4 hour tasks. The article correctly identifies this as a structural scaling problem.
Counter-argument: METR’s TH1.1 update already doubled the number of 8+ hour tasks (from 14 to 31), showing incremental extension is possible. The real question is whether the evaluation community will shift to real-world deployment metrics (e.g., code review acceptance rates, customer support resolution rates) rather than trying to build ever-longer synthetic tasks. The article does not adequately explore this alternative.
References:
- METR Time Horizon 1.1 update
- BetterBench: Assessing AI Benchmarks (Stanford)

Author background: Timothy B. Lee holds a Master’s in Computer Science from Princeton (studied under Ed Felten). Former reporter at Ars Technica, Vox, and the Washington Post. Co-creator of RECAP (PACER document liberation tool). Founded Understanding AI in 2023 after 18 months writing Full Stack Economics. He has a strong technical background and a track record of clear, accurate technology reporting.
Publication bias: Independent Substack newsletter. No venture funding, no corporate parent. Lee’s incentive is subscriber growth, which favors clarity and accuracy over hype. He interviews primary sources (Joel Becker, David Rein from METR) and cites specific data.
Verdict: high — Lee is an established independent journalist with a computer science background, citing primary sources and published data. The article is measured in its claims and explicit about uncertainty. The main limitation is that the “benchmarks can’t capture real work” thesis, while plausible, is presented more as received wisdom than as an empirically tested hypothesis.