METR (Model Evaluation & Threat Research)

★ New
assess
AI / ML vendor N/A (nonprofit research organization) free (public research outputs)

What It Does

METR (Model Evaluation & Threat Research) is a Berkeley-based 501(c)(3) nonprofit research organization that evaluates frontier AI models for dangerous autonomous capabilities before deployment. Founded in August 2022 by Beth Barnes (formerly of OpenAI and Google DeepMind) as ARC Evals, it spun out from the Alignment Research Center in September 2023 and rebranded to METR in December 2023.

METR’s core work involves conducting pre-deployment safety evaluations for leading AI labs (OpenAI, Anthropic, Google DeepMind), developing benchmarks that measure autonomous AI capabilities (HCAST, RE-Bench, Time Horizons), and producing policy-relevant research on AI safety. They also analyze frontier AI safety policies across companies and conduct original research on topics like developer productivity, reward hacking, and AI monitorability.

Key Features

  • Pre-deployment safety evaluations for frontier models (GPT-5.1, GPT-5, o3, Claude 3.7, DeepSeek R1/V3) measuring catastrophic risk potential
  • HCAST benchmark: 189 tasks across ML, cybersecurity, software engineering, and reasoning with human-calibrated baselines (140 people, 563 attempts)
  • RE-Bench: ML research engineering evaluation comparing AI to 71 human experts
  • Time Horizon metric: tracks exponential growth in AI task-completion ability (~7-month doubling time over 6 years)
  • METR Task Standard: portable format for defining AI evaluation tasks, adopted across the eval ecosystem
  • Vivaria evaluation platform (open-source, now transitioning to UK AISI’s Inspect)
  • MALT dataset: curated examples of behaviors threatening evaluation integrity
  • Common Elements of Frontier AI Safety Policies: analysis of 12 companies’ voluntary safety commitments
  • Developer productivity RCT: rigorous randomized controlled trial (n=16, finding 19% slowdown with AI tools)
  • Agent scaffolding research: modular-public, flock-public, triframe agent architectures for evaluation

Use Cases

  • AI safety evaluation: Organizations needing independent third-party assessment of frontier model capabilities and risks before deployment
  • Policy development: Governments and regulators using METR’s benchmarks and policy analyses to inform AI safety regulation
  • Benchmarking: Researchers and companies using HCAST, RE-Bench, and Time Horizons as reference metrics for AI progress
  • Red-teaming methodology: Labs adopting METR’s evaluation protocols and task standards for internal safety testing

Adoption Level Analysis

Small teams (<20 engineers): Not directly applicable. METR is a research organization, not a product. Small teams can consume their public research, use HCAST/RE-Bench as benchmarks, or adopt the METR Task Standard.

Medium orgs (20-200 engineers): Relevant as consumers of METR’s public research outputs and open-source tools. The Task Standard and Inspect (recommended replacement for Vivaria) are usable for internal evaluation work.

Enterprise (200+ engineers): Primary audience. Frontier AI labs engage METR for pre-deployment evaluations. Large organizations and governments use METR’s policy analysis and benchmark data for strategic planning and regulatory compliance.

Alternatives

AlternativeKey DifferencePrefer when…
Apollo ResearchFocuses on AI scheming/deception, less on autonomous capabilitiesYou need evaluation of strategic deception and misalignment behaviors
Epoch AIFocuses on AI progress forecasting and compute trends, not model evaluationYou need macro-level AI progress tracking and forecasting
UK AISI (AI Safety Institute)Government body with regulatory mandate, produces Inspect frameworkYou need government-backed evaluation framework or compliance
Redwood ResearchFocuses on interpretability and alignment researchYou need mechanistic understanding of model behaviors
RAND CorporationBroader policy focus, less technical depth on evalsYou need policy-oriented AI risk assessment

Evidence & Sources

Notes & Caveats

  • Funding dependency risk: While METR does not accept direct cash from AI companies, it receives compute credits from OpenAI and Anthropic. Its entire pre-deployment evaluation program depends on labs voluntarily granting early access. If a lab decided not to cooperate, METR’s core value proposition would be diminished.
  • Methodology limitations acknowledged by METR: Their August 2025 “Algorithmic vs. Holistic Evaluation” post admitted that 38% algorithmic success rates on benchmarks yield 0% mergeable PRs, suggesting benchmarks overestimate real-world AI utility. Their Time Horizon graph is widely misinterpreted despite METR’s own caveats.
  • Measurement noise at the frontier: The TH1.0 confidence interval for Claude Opus 4.6 spanned [319, 3949] minutes (~5 to 66 hours), an order-of-magnitude range that undermines precision. TH1.1 (January 2026) improved this by expanding the task suite 34% and doubling 8+ hour tasks (14 to 31), reducing the upper bound multiplier from 4.4x to 2.3x. However, the logistic regression model still extrapolates beyond its calibrated range when frontier models exceed the hardest tasks.
  • Benchmark extension faces economic limits: Creating tasks requiring 40+ hours of human effort for calibration is expensive ($2,000+ per human attempt at $50/hour minimum) and difficult to staff. This structural scaling constraint may limit how far METR can extend the Time Horizons approach without fundamentally rethinking the methodology.
  • Scope concentration: Nearly all METR benchmarks are software-engineering heavy. Their Time Horizon analysis across domains (July 2025) is the first attempt to diversify, but the flagship metric remains coding-centric.
  • Vivaria deprecation: METR is winding down Vivaria in favor of UK AISI’s Inspect framework. Existing Vivaria users should plan migration.
  • Team concentration risk: Small team (~30 people) producing evaluation reports used by major AI labs and governments. This is a bottleneck with no redundancy.
  • Publication process: Research is internally reviewed but not formally peer-reviewed (though papers appear on arXiv). Some work appears first on Substack/blog, not in academic venues.