What It Does
HCAST (Human-Calibrated Autonomy Software Tasks) is METR’s primary benchmark for measuring frontier AI model capacity for autonomous software task completion. It consists of 189 tasks grouped into 78 families across four domains: machine learning engineering, cybersecurity, software engineering, and general reasoning. Tasks range from 1 minute to 8+ hours of human completion time.
The “human-calibrated” aspect is the key differentiator: 140 skilled domain experts made 563 attempts to complete the tasks, providing grounded human baselines. This allows METR to report a model’s “time horizon” — the task duration (measured by human completion time) at which an AI agent achieves 50% success probability — rather than simply reporting pass/fail rates.
Key Features
- 189 tasks across 78 families spanning ML, cybersecurity, software engineering, and reasoning
- Human-calibrated baselines from 140 skilled domain experts (563 total attempts)
- Task difficulty measured in human completion time (1 minute to 8+ hours)
- Logistic regression model for computing 50% time horizon per AI model
- Portable task definitions using the METR Task Standard
- Used in official pre-deployment evaluations for OpenAI (o3, GPT-5, GPT-5.1), Anthropic (Claude 3.7), DeepSeek (R1, V3)
- Time Horizon tracking showing ~7-month doubling time over 6 years
- Publicly available task subset via GitHub
Use Cases
- Pre-deployment safety assessment: Measuring whether a new model has dangerous autonomous capabilities before release
- AI progress tracking: Using time horizon as a consistent metric across model generations
- Safety threshold monitoring: Detecting when AI agents approach capability levels requiring additional mitigations
- Research on evaluation methodology: Studying the relationship between benchmark performance and real-world capability
Adoption Level Analysis
Small teams (<20 engineers): Limited direct utility. HCAST is designed for evaluating frontier models, which small teams typically do not develop. The public task subset could be used for educational purposes.
Medium orgs (20-200 engineers): Relevant if you are building AI agents and want to benchmark against a credible standard. The METR Task Standard format is reusable for custom evaluations.
Enterprise (200+ engineers): Primary audience. Frontier AI labs use HCAST for pre-deployment evaluations. Governments and regulators reference HCAST time horizon data in policy discussions.
Alternatives
| Alternative | Key Difference | Prefer when… |
|---|---|---|
| SWE-bench | Focused on real GitHub issues from popular repos | You need evaluation on actual open-source codebases |
| RE-Bench (METR) | ML research engineering specifically, with expert baselines | You need AI R&D capability assessment specifically |
| GPQA | Graduate-level Q&A, not agentic tasks | You need knowledge/reasoning evaluation, not autonomous task completion |
| FrontierMath (Epoch AI) | Extremely hard math problems | You need mathematical reasoning benchmarks |
Evidence & Sources
- arXiv: HCAST (2503.17354)
- METR: Measuring AI Ability to Complete Long Tasks (blog)
- Epoch AI: METR Time Horizons tracking
- MIT Technology Review: This is the most misunderstood graph in AI
- Are We There Yet? Evaluating METR’s Eval (Empiricrafting)
Notes & Caveats
- Coding-centric: Despite four stated domains, the benchmark is overwhelmingly software-engineering tasks. The July 2025 cross-domain analysis was a first attempt at diversification.
- Human baseline inflation: Critics note that repo maintainers are 5-18x faster than METR’s baseline testers, meaning reported time horizons may overstate practical AI capability.
- Algorithmic vs. holistic gap: METR’s own August 2025 research found that 38% algorithmic success on tests yields 0% mergeable PRs, suggesting the benchmark overstates real-world utility.
- Wide confidence intervals: The TH1.0 interval for Claude Opus 4.6 was [319, 3949] minutes (~5-66 hours), a 12x range. TH1.1 (January 2026) improved this by expanding the suite to 228 tasks and doubling 8+ hour tasks (14 to 31), reducing the upper bound multiplier from 4.4x to 2.3x. The logistic regression model extrapolates beyond calibrated task difficulty when frontier models exceed the hardest available tasks.
- Time horizon is easily misinterpreted: “4-hour time horizon” does not mean the model replaces 4 hours of human work. It means it achieves 50% success on tasks that take humans 4 hours in the benchmark environment.
- Benchmark extension faces scaling limits: Creating tasks requiring 40+ hours of human effort is expensive ($2,000+ per human calibration attempt) and difficult to staff, imposing structural limits on how far the task suite can extend.
- Not publicly reproducible in full: Only a subset of tasks is publicly available. The full suite requires partnership with METR.