What It Does

Vivaria is METR’s open-source platform for running AI agent evaluations and conducting elicitation research. It provides infrastructure for starting task environments based on the METR Task Standard, running AI agents inside those environments, and analyzing the results through dashboards and detailed trace logs. Vivaria supports viewing LLM API requests/responses, agent actions and observations, and allows editing runs mid-flight to test counterfactual outcomes.

METR used Vivaria internally for all major pre-deployment evaluations (GPT-4o, o1-preview, o3, Claude 3.5/3.7, DeepSeek R1/V3). However, as of early 2026, METR is ramping down new feature development on Vivaria and recommending that new evaluation projects use the UK AISI’s Inspect framework instead.

Key Features

Task environment management based on METR Task Standard definitions
AI agent execution within sandboxed task environments
LLM API request/response logging and visualization
Agent action and observation trace analysis
Large-scale dashboards for evaluation campaigns
Mid-run editing to test alternative trajectories
Database-backed result storage for cross-evaluation analysis
Support for multiple agent scaffolding architectures (modular-public, flock-public, triframe)

Use Cases

Pre-deployment safety evaluation: Running AI agents against standardized task suites to assess autonomous capabilities
Agent elicitation research: Testing how different prompting strategies, scaffolding, and tool access affect agent performance
Evaluation reproducibility: Re-running evaluations with controlled parameters for consistency

Adoption Level Analysis

Small teams (<20 engineers): Likely overkill. Vivaria was built for METR’s specific evaluation workflows. Setting up and maintaining it requires non-trivial infrastructure. Use Inspect instead.

Medium orgs (20-200 engineers): Possible if you have existing investment in Vivaria, but migration to Inspect is recommended given the deprecation trajectory.

Enterprise (200+ engineers): METR itself used Vivaria at this scale. Frontier AI labs that adopted it for internal evals should plan migration to Inspect.

Alternatives

Alternative	Key Difference	Prefer when…
Inspect AI (UK AISI)	Actively developed, 100+ pre-built evals, broader community	Starting new evaluation projects (METR’s own recommendation)
EleutherAI lm-evaluation-harness	Focused on static model evals, not agentic tasks	You need traditional LLM benchmarking, not agent evaluation
OpenAI Evals	OpenAI’s eval framework, vendor-specific	You are evaluating OpenAI models specifically

Evidence & Sources

Notes & Caveats

Deprecated in favor of Inspect: METR is ramping down new Vivaria feature development. The organization recommends Inspect for new projects. This is a clear signal to avoid new adoption.
Migration path exists: METR has published a comparison document between Vivaria and Inspect to help users transition.
Niche use case: Vivaria was purpose-built for METR’s specific agentic evaluation workflow. It is not a general-purpose LLM evaluation tool.
Infrastructure requirements: Requires Docker, database setup, and non-trivial configuration. Not a “run and go” tool.

Vivaria

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

HCAST (Human-Calibrated Autonomy Software Tasks)

Humanity's Last Exam (HLE)

Inspect AI

METR (Model Evaluation & Threat Research)