What It Does

METR (Model Evaluation & Threat Research) is a Berkeley-based 501(c)(3) nonprofit research organization that evaluates frontier AI models for dangerous autonomous capabilities before deployment. Founded in August 2022 by Beth Barnes (formerly of OpenAI and Google DeepMind) as ARC Evals, it spun out from the Alignment Research Center in September 2023 and rebranded to METR in December 2023.

METR’s core work involves conducting pre-deployment safety evaluations for leading AI labs (OpenAI, Anthropic, Google DeepMind), developing benchmarks that measure autonomous AI capabilities (HCAST, RE-Bench, Time Horizons), and producing policy-relevant research on AI safety. They also analyze frontier AI safety policies across companies and conduct original research on topics like developer productivity, reward hacking, and AI monitorability.

Key Features

Pre-deployment safety evaluations for frontier models (GPT-5.1, GPT-5, o3, Claude 3.7, DeepSeek R1/V3) measuring catastrophic risk potential
HCAST benchmark: 189 tasks across ML, cybersecurity, software engineering, and reasoning with human-calibrated baselines (140 people, 563 attempts)
RE-Bench: ML research engineering evaluation comparing AI to 71 human experts
Time Horizon metric: tracks exponential growth in AI task-completion ability (~7-month doubling time over 6 years)
METR Task Standard: portable format for defining AI evaluation tasks, adopted across the eval ecosystem
Vivaria evaluation platform (open-source, now transitioning to UK AISI’s Inspect)
MALT dataset: curated examples of behaviors threatening evaluation integrity
Common Elements of Frontier AI Safety Policies: analysis of 12 companies’ voluntary safety commitments
Developer productivity RCT: rigorous randomized controlled trial (n=16, finding 19% slowdown with AI tools)
Agent scaffolding research: modular-public, flock-public, triframe agent architectures for evaluation

Use Cases

AI safety evaluation: Organizations needing independent third-party assessment of frontier model capabilities and risks before deployment
Policy development: Governments and regulators using METR’s benchmarks and policy analyses to inform AI safety regulation
Benchmarking: Researchers and companies using HCAST, RE-Bench, and Time Horizons as reference metrics for AI progress
Red-teaming methodology: Labs adopting METR’s evaluation protocols and task standards for internal safety testing

Adoption Level Analysis

Small teams (<20 engineers): Not directly applicable. METR is a research organization, not a product. Small teams can consume their public research, use HCAST/RE-Bench as benchmarks, or adopt the METR Task Standard.

Medium orgs (20-200 engineers): Relevant as consumers of METR’s public research outputs and open-source tools. The Task Standard and Inspect (recommended replacement for Vivaria) are usable for internal evaluation work.

Enterprise (200+ engineers): Primary audience. Frontier AI labs engage METR for pre-deployment evaluations. Large organizations and governments use METR’s policy analysis and benchmark data for strategic planning and regulatory compliance.

Alternatives

Alternative	Key Difference	Prefer when…
Apollo Research	Focuses on AI scheming/deception, less on autonomous capabilities	You need evaluation of strategic deception and misalignment behaviors
Epoch AI	Focuses on AI progress forecasting and compute trends, not model evaluation	You need macro-level AI progress tracking and forecasting
UK AISI (AI Safety Institute)	Government body with regulatory mandate, produces Inspect framework	You need government-backed evaluation framework or compliance
Redwood Research	Focuses on interpretability and alignment research	You need mechanistic understanding of model behaviors
RAND Corporation	Broader policy focus, less technical depth on evals	You need policy-oriented AI risk assessment

Evidence & Sources

Notes & Caveats

Funding dependency risk: While METR does not accept direct cash from AI companies, it receives compute credits from OpenAI and Anthropic. Its entire pre-deployment evaluation program depends on labs voluntarily granting early access. If a lab decided not to cooperate, METR’s core value proposition would be diminished.
Methodology limitations acknowledged by METR: Their August 2025 “Algorithmic vs. Holistic Evaluation” post admitted that 38% algorithmic success rates on benchmarks yield 0% mergeable PRs, suggesting benchmarks overestimate real-world AI utility. Their Time Horizon graph is widely misinterpreted despite METR’s own caveats.
Measurement noise at the frontier: The TH1.0 confidence interval for Claude Opus 4.6 spanned [319, 3949] minutes (~5 to 66 hours), an order-of-magnitude range that undermines precision. TH1.1 (January 2026) improved this by expanding the task suite 34% and doubling 8+ hour tasks (14 to 31), reducing the upper bound multiplier from 4.4x to 2.3x. However, the logistic regression model still extrapolates beyond its calibrated range when frontier models exceed the hardest tasks.
Benchmark extension faces economic limits: Creating tasks requiring 40+ hours of human effort for calibration is expensive ($2,000+ per human attempt at $50/hour minimum) and difficult to staff. This structural scaling constraint may limit how far METR can extend the Time Horizons approach without fundamentally rethinking the methodology.
Scope concentration: Nearly all METR benchmarks are software-engineering heavy. Their Time Horizon analysis across domains (July 2025) is the first attempt to diversify, but the flagship metric remains coding-centric.
Vivaria deprecation: METR is winding down Vivaria in favor of UK AISI’s Inspect framework. Existing Vivaria users should plan migration.
Team concentration risk: Small team (~30 people) producing evaluation reports used by major AI labs and governments. This is a bottleneck with no redundancy.
Publication process: Research is internally reviewed but not formally peer-reviewed (though papers appear on arXiv). Some work appears first on Substack/blog, not in academic venues.

METR (Model Evaluation & Threat Research)

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

Redwood Research

AI Safety Evaluation (Pre-Deployment)

Benchmark Saturation

HCAST (Human-Calibrated Autonomy Software Tasks)