What It Does

The METR Task Standard is a portable specification for defining AI agent evaluation tasks. It provides a standardized format for packaging task definitions — including environment setup, instructions, scoring criteria, and resource requirements — so they can be shared, reproduced, and run across different evaluation platforms (originally Vivaria, now also Inspect and others).

As of early 2024, METR had used the standard to define approximately 200 task families containing approximately 2,000 individual tasks across categories including AI R&D, cybersecurity, and general autonomous capacity. The standard enables evaluation portability, meaning a task defined once can be run by any compatible evaluation harness without modification.

Key Features

Standardized task definition format with environment specs, instructions, scoring, and resource requirements
Docker-based task environments for reproducible execution
Support for multi-step, agentic task definitions (not just single-turn Q&A)
Scoring function specification for automated evaluation
Task family grouping for organizing related tasks at different difficulty levels
Used across HCAST (189 tasks), RE-Bench, and METR’s internal evaluation suites
Compatible with both Vivaria and Inspect evaluation platforms
YAML/JSON-based configuration for machine readability

Use Cases

Evaluation portability: Defining tasks once and running them across multiple evaluation platforms
Benchmark creation: Packaging custom evaluation suites in a shareable, reproducible format
Community evaluation sharing: Contributing tasks to shared repositories (e.g., Inspect Evals)
Regulatory evaluation: Standardizing the format for compliance-oriented AI assessments

Adoption Level Analysis

Small teams (<20 engineers): Accessible. The standard is straightforward to adopt for anyone creating AI evaluations. Low overhead, just a specification to follow.

Medium orgs (20-200 engineers): Good fit for teams building evaluation infrastructure. Adopting the standard ensures compatibility with the broader eval ecosystem.

Enterprise (200+ engineers): Relevant for organizations with formal AI evaluation programs. The standard’s use by METR in official pre-deployment evaluations gives it institutional credibility.

Alternatives

Alternative	Key Difference	Prefer when…
Inspect eval format	UK AISI’s native format, Python-based	You are using Inspect exclusively and prefer Python definitions
SWE-bench task format	GitHub issue-based task definitions	You are evaluating on real open-source repo issues
Custom formats	Proprietary internal formats	You have specific requirements not met by existing standards

Evidence & Sources

Notes & Caveats

Narrow adoption beyond METR: While the standard is well-designed, it is primarily used by METR and organizations working directly with METR. Broader ecosystem adoption is not well documented.
Inspect compatibility unclear: As METR transitions to Inspect, the relationship between the Task Standard and Inspect’s native eval format needs clarification. They may converge or the Task Standard may become a legacy format.
Docker dependency: Task environments require Docker, which adds infrastructure requirements for running evaluations locally.

METR Task Standard

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

HCAST (Human-Calibrated Autonomy Software Tasks)

Humanity's Last Exam (HLE)

Inspect AI

METR (Model Evaluation & Threat Research)