AI Safety Evaluation (Pre-Deployment): Review, Radar Rating & Alternatives

What It Does

AI Safety Evaluation (Pre-Deployment) is an emerging pattern where frontier AI models are assessed by independent third parties for dangerous autonomous capabilities before public release. The pattern involves running AI agents against standardized task suites that measure capabilities in risk-relevant domains (autonomous replication, cyber offense, biological weapon facilitation, AI R&D acceleration), comparing performance against human baselines, and producing evaluation reports that inform deployment decisions.

The pattern was pioneered by METR (then ARC Evals) in 2022-2023, when they conducted the first pre-deployment evaluations of GPT-4 and Claude. It has since been formalized through voluntary commitments at the AI Seoul Summit (May 2024, 16 companies), company-specific frontier AI safety policies (12 companies as of December 2025), and government frameworks like the US Executive Order on AI (October 2023) and the EU AI Act.

Key Features

Capability threshold testing: Evaluating whether models cross predefined risk thresholds for specific dangerous capabilities
Human-calibrated baselines: Measuring AI performance relative to human experts on the same tasks, not just absolute scores
Agent elicitation protocols: Systematic approaches to find the maximum capability of a model through prompting strategies, scaffolding, and tool access
Evaluation integrity measures: Detecting and preventing models from gaming evaluations (reward hacking, evaluation awareness)
Red-teaming: Adversarial testing to find failure modes and unexpected dangerous capabilities
System cards: Structured disclosure documents summarizing evaluation results, published alongside model releases
Monitorability assessment: Testing whether AI agent behavior can be effectively monitored by humans or automated systems
Third-party independence: Evaluations conducted by organizations not financially dependent on the AI lab being evaluated

Use Cases

Pre-deployment safety gate: Labs use evaluations as a go/no-go signal for model releases
Regulatory compliance: Demonstrating due diligence to regulators under frameworks like the EU AI Act
Frontier AI safety policy compliance: Meeting voluntary commitments made under the AI Seoul Summit
Risk communication: Providing structured information about model capabilities to deployers and policymakers
Capability tracking: Monitoring exponential growth in AI capabilities across model generations

Adoption Level Analysis

Small teams (<20 engineers): Not applicable. Pre-deployment safety evaluation is relevant to organizations developing frontier AI models, not consuming them.

Medium orgs (20-200 engineers): Minimal direct relevance unless building AI agents with autonomous capabilities. May consume evaluation results for vendor selection.

Enterprise (200+ engineers): Highly relevant for frontier AI labs, large deployers of AI systems in regulated industries, and government agencies overseeing AI development. The pattern is becoming a regulatory expectation.

Alternatives

Alternative	Key Difference	Prefer when…
Internal red-teaming	Lab-internal, not independent	Quick iteration during development, not final safety assessment
Bug bounty programs	Crowd-sourced, post-deployment	You need broad coverage of failure modes after release
Formal verification	Mathematical proofs of properties	You need provable guarantees (not yet practical for LLMs)
Capability benchmarks (general)	Measure capability, not risk	You want to understand model performance, not safety specifically

Evidence & Sources

Notes & Caveats

Voluntary, not mandatory (mostly): As of early 2026, pre-deployment evaluation is largely voluntary. The EU AI Act introduces some requirements, but enforcement mechanisms are still developing. Companies can stop cooperating with third-party evaluators at any time.
Evaluator capture risk: Independent evaluators depend on labs for model access. This creates subtle incentive alignment where evaluators may avoid being “too critical” to maintain access. METR has demonstrated willingness to publish embarrassing findings (reward hacking in o3), but the structural incentive remains.
Goodhart’s Law applies: As evaluations become standardized, models may be optimized to perform well on evaluation tasks without genuine safety improvement. METR’s MALT dataset documents early instances of this.
Capability thresholds are arbitrary: The specific thresholds for “dangerous” autonomous capability are judgment calls, not scientifically derived limits. Different organizations may set different thresholds.
Evaluation gaps acknowledged: METR’s own research shows that algorithmic evaluation overestimates real-world capability. There is no consensus on what constitutes a “sufficient” evaluation for safe deployment.
Rapidly evolving field: Evaluation methodologies are changing faster than they can be validated. The pattern is being formalized while still being invented.

AI Safety Evaluation (Pre-Deployment)

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

METR (Model Evaluation & Threat Research)

Anthropic

Apollo Research

Humanity's Last Exam (HLE)