What It Does

HCAST (Human-Calibrated Autonomy Software Tasks) is METR’s primary benchmark for measuring frontier AI model capacity for autonomous software task completion. It consists of 189 tasks grouped into 78 families across four domains: machine learning engineering, cybersecurity, software engineering, and general reasoning. Tasks range from 1 minute to 8+ hours of human completion time.

The “human-calibrated” aspect is the key differentiator: 140 skilled domain experts made 563 attempts to complete the tasks, providing grounded human baselines. This allows METR to report a model’s “time horizon” — the task duration (measured by human completion time) at which an AI agent achieves 50% success probability — rather than simply reporting pass/fail rates.

Key Features

189 tasks across 78 families spanning ML, cybersecurity, software engineering, and reasoning
Human-calibrated baselines from 140 skilled domain experts (563 total attempts)
Task difficulty measured in human completion time (1 minute to 8+ hours)
Logistic regression model for computing 50% time horizon per AI model
Portable task definitions using the METR Task Standard
Used in official pre-deployment evaluations for OpenAI (o3, GPT-5, GPT-5.1), Anthropic (Claude 3.7), DeepSeek (R1, V3)
Time Horizon tracking showing ~7-month doubling time over 6 years
Publicly available task subset via GitHub

Use Cases

Pre-deployment safety assessment: Measuring whether a new model has dangerous autonomous capabilities before release
AI progress tracking: Using time horizon as a consistent metric across model generations
Safety threshold monitoring: Detecting when AI agents approach capability levels requiring additional mitigations
Research on evaluation methodology: Studying the relationship between benchmark performance and real-world capability

Adoption Level Analysis

Small teams (<20 engineers): Limited direct utility. HCAST is designed for evaluating frontier models, which small teams typically do not develop. The public task subset could be used for educational purposes.

Medium orgs (20-200 engineers): Relevant if you are building AI agents and want to benchmark against a credible standard. The METR Task Standard format is reusable for custom evaluations.

Enterprise (200+ engineers): Primary audience. Frontier AI labs use HCAST for pre-deployment evaluations. Governments and regulators reference HCAST time horizon data in policy discussions.

Alternatives

Alternative	Key Difference	Prefer when…
SWE-bench	Focused on real GitHub issues from popular repos	You need evaluation on actual open-source codebases
RE-Bench (METR)	ML research engineering specifically, with expert baselines	You need AI R&D capability assessment specifically
GPQA	Graduate-level Q&A, not agentic tasks	You need knowledge/reasoning evaluation, not autonomous task completion
FrontierMath (Epoch AI)	Extremely hard math problems	You need mathematical reasoning benchmarks

Evidence & Sources

Notes & Caveats

Coding-centric: Despite four stated domains, the benchmark is overwhelmingly software-engineering tasks. The July 2025 cross-domain analysis was a first attempt at diversification.
Human baseline inflation: Critics note that repo maintainers are 5-18x faster than METR’s baseline testers, meaning reported time horizons may overstate practical AI capability.
Algorithmic vs. holistic gap: METR’s own August 2025 research found that 38% algorithmic success on tests yields 0% mergeable PRs, suggesting the benchmark overstates real-world utility.
Wide confidence intervals: The TH1.0 interval for Claude Opus 4.6 was [319, 3949] minutes (~5-66 hours), a 12x range. TH1.1 (January 2026) improved this by expanding the suite to 228 tasks and doubling 8+ hour tasks (14 to 31), reducing the upper bound multiplier from 4.4x to 2.3x. The logistic regression model extrapolates beyond calibrated task difficulty when frontier models exceed the hardest available tasks.
Time horizon is easily misinterpreted: “4-hour time horizon” does not mean the model replaces 4 hours of human work. It means it achieves 50% success on tasks that take humans 4 hours in the benchmark environment.
Benchmark extension faces scaling limits: Creating tasks requiring 40+ hours of human effort is expensive ($2,000+ per human calibration attempt) and difficult to staff, imposing structural limits on how far the task suite can extend.
Not publicly reproducible in full: Only a subset of tasks is publicly available. The full suite requires partnership with METR.

HCAST (Human-Calibrated Autonomy Software Tasks)

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

Humanity's Last Exam (HLE)

METR (Model Evaluation & Threat Research)

SWE-bench

Inspect AI