METR Task Standard

★ New
assess
AI / ML open-source MIT open-source

What It Does

The METR Task Standard is a portable specification for defining AI agent evaluation tasks. It provides a standardized format for packaging task definitions — including environment setup, instructions, scoring criteria, and resource requirements — so they can be shared, reproduced, and run across different evaluation platforms (originally Vivaria, now also Inspect and others).

As of early 2024, METR had used the standard to define approximately 200 task families containing approximately 2,000 individual tasks across categories including AI R&D, cybersecurity, and general autonomous capacity. The standard enables evaluation portability, meaning a task defined once can be run by any compatible evaluation harness without modification.

Key Features

  • Standardized task definition format with environment specs, instructions, scoring, and resource requirements
  • Docker-based task environments for reproducible execution
  • Support for multi-step, agentic task definitions (not just single-turn Q&A)
  • Scoring function specification for automated evaluation
  • Task family grouping for organizing related tasks at different difficulty levels
  • Used across HCAST (189 tasks), RE-Bench, and METR’s internal evaluation suites
  • Compatible with both Vivaria and Inspect evaluation platforms
  • YAML/JSON-based configuration for machine readability

Use Cases

  • Evaluation portability: Defining tasks once and running them across multiple evaluation platforms
  • Benchmark creation: Packaging custom evaluation suites in a shareable, reproducible format
  • Community evaluation sharing: Contributing tasks to shared repositories (e.g., Inspect Evals)
  • Regulatory evaluation: Standardizing the format for compliance-oriented AI assessments

Adoption Level Analysis

Small teams (<20 engineers): Accessible. The standard is straightforward to adopt for anyone creating AI evaluations. Low overhead, just a specification to follow.

Medium orgs (20-200 engineers): Good fit for teams building evaluation infrastructure. Adopting the standard ensures compatibility with the broader eval ecosystem.

Enterprise (200+ engineers): Relevant for organizations with formal AI evaluation programs. The standard’s use by METR in official pre-deployment evaluations gives it institutional credibility.

Alternatives

AlternativeKey DifferencePrefer when…
Inspect eval formatUK AISI’s native format, Python-basedYou are using Inspect exclusively and prefer Python definitions
SWE-bench task formatGitHub issue-based task definitionsYou are evaluating on real open-source repo issues
Custom formatsProprietary internal formatsYou have specific requirements not met by existing standards

Evidence & Sources

Notes & Caveats

  • Narrow adoption beyond METR: While the standard is well-designed, it is primarily used by METR and organizations working directly with METR. Broader ecosystem adoption is not well documented.
  • Inspect compatibility unclear: As METR transitions to Inspect, the relationship between the Task Standard and Inspect’s native eval format needs clarification. They may converge or the Task Standard may become a legacy format.
  • Docker dependency: Task environments require Docker, which adds infrastructure requirements for running evaluations locally.