Access Required

This site is not public yet. Enter the password to continue.

Vivaria

★ New
hold
AI / ML open-source MIT open-source

What It Does

Vivaria is METR’s open-source platform for running AI agent evaluations and conducting elicitation research. It provides infrastructure for starting task environments based on the METR Task Standard, running AI agents inside those environments, and analyzing the results through dashboards and detailed trace logs. Vivaria supports viewing LLM API requests/responses, agent actions and observations, and allows editing runs mid-flight to test counterfactual outcomes.

METR used Vivaria internally for all major pre-deployment evaluations (GPT-4o, o1-preview, o3, Claude 3.5/3.7, DeepSeek R1/V3). However, as of early 2026, METR is ramping down new feature development on Vivaria and recommending that new evaluation projects use the UK AISI’s Inspect framework instead.

Key Features

  • Task environment management based on METR Task Standard definitions
  • AI agent execution within sandboxed task environments
  • LLM API request/response logging and visualization
  • Agent action and observation trace analysis
  • Large-scale dashboards for evaluation campaigns
  • Mid-run editing to test alternative trajectories
  • Database-backed result storage for cross-evaluation analysis
  • Support for multiple agent scaffolding architectures (modular-public, flock-public, triframe)

Use Cases

  • Pre-deployment safety evaluation: Running AI agents against standardized task suites to assess autonomous capabilities
  • Agent elicitation research: Testing how different prompting strategies, scaffolding, and tool access affect agent performance
  • Evaluation reproducibility: Re-running evaluations with controlled parameters for consistency

Adoption Level Analysis

Small teams (<20 engineers): Likely overkill. Vivaria was built for METR’s specific evaluation workflows. Setting up and maintaining it requires non-trivial infrastructure. Use Inspect instead.

Medium orgs (20-200 engineers): Possible if you have existing investment in Vivaria, but migration to Inspect is recommended given the deprecation trajectory.

Enterprise (200+ engineers): METR itself used Vivaria at this scale. Frontier AI labs that adopted it for internal evals should plan migration to Inspect.

Alternatives

AlternativeKey DifferencePrefer when…
Inspect AI (UK AISI)Actively developed, 100+ pre-built evals, broader communityStarting new evaluation projects (METR’s own recommendation)
EleutherAI lm-evaluation-harnessFocused on static model evals, not agentic tasksYou need traditional LLM benchmarking, not agent evaluation
OpenAI EvalsOpenAI’s eval framework, vendor-specificYou are evaluating OpenAI models specifically

Evidence & Sources

Notes & Caveats

  • Deprecated in favor of Inspect: METR is ramping down new Vivaria feature development. The organization recommends Inspect for new projects. This is a clear signal to avoid new adoption.
  • Migration path exists: METR has published a comparison document between Vivaria and Inspect to help users transition.
  • Niche use case: Vivaria was purpose-built for METR’s specific agentic evaluation workflow. It is not a general-purpose LLM evaluation tool.
  • Infrastructure requirements: Requires Docker, database setup, and non-trivial configuration. Not a “run and go” tool.