Inspect AI

★ New
trial
AI / ML open-source MIT open-source

What It Does

Inspect AI is an open-source framework for large language model evaluations created by the UK AI Safety Institute (AISI). Open-sourced in May 2024, it provides a standardized way to build, run, and analyze evaluations measuring coding ability, agentic task completion, reasoning, knowledge, behavioral safety, and multi-modal understanding.

Inspect is now the recommended replacement for METR’s Vivaria platform. It has broader community adoption (50+ contributors from safety institutes, frontier labs, and research organizations) and comes with 100+ pre-built evaluations ready to run on any model. The framework includes a web-based viewer for monitoring evaluations, a VS Code extension for authoring/debugging, and flexible tool-calling support including MCP tools.

Key Features

  • 100+ pre-built evaluations (Inspect Evals) covering safety, capability, reasoning, and coding domains
  • Straightforward Python interfaces for implementing custom evaluations
  • Flexible tool-calling: built-in bash, Python, text editing, web search, web browsing, computer tools, plus MCP tool support
  • Web-based Inspect View for monitoring and visualizing evaluation runs
  • VS Code Extension for authoring and debugging evaluations
  • Model-agnostic: works with any LLM provider
  • Composable evaluation components for reuse across projects
  • Agent evaluation support for multi-step, tool-using AI systems
  • Inspect Evals repository: community-contributed evaluations maintained collaboratively by UK AISI, Arcadia Impact, and Vector Institute
  • Active development with regular feature releases

Use Cases

  • AI safety evaluation: Assessing model capabilities for dangerous autonomous behaviors before deployment
  • Benchmark development: Creating reproducible evaluations with standardized scoring
  • Red-teaming: Running adversarial evaluations to find model failure modes
  • Regulatory compliance: Government and enterprise teams evaluating AI systems against safety standards
  • Research: Academic and industry researchers needing a flexible evaluation harness

Adoption Level Analysis

Small teams (<20 engineers): Fits well. Python-based, pip-installable, with pre-built evals that work out of the box. Low barrier to entry for basic model evaluation. The VS Code extension helps with development workflow.

Medium orgs (20-200 engineers): Good fit. Enough flexibility for custom evaluation development, and the community-maintained eval repository means less wheel reinvention. Reasonable learning curve.

Enterprise (200+ engineers): Strong fit. Backed by the UK government, which provides institutional stability. Multiple frontier labs and safety organizations already contribute. Suitable for compliance-oriented evaluation programs.

Alternatives

AlternativeKey DifferencePrefer when…
Vivaria (METR)METR’s previous platform, now deprecatedLegacy investment only; migrate to Inspect
EleutherAI lm-evaluation-harnessStatic model evals, no agent supportYou need traditional benchmarking without agentic tasks
OpenAI EvalsOpenAI-ecosystem-specificYou are exclusively evaluating OpenAI models
PromptfooDeveloper-focused prompt testingYou need fast prompt iteration, not safety evaluation

Evidence & Sources

Notes & Caveats

  • Government-backed but open: UK AISI maintains Inspect but it is MIT-licensed with genuine community governance. The government backing provides stability but could create perception issues in some jurisdictions.
  • METR endorsement is a strong signal: METR transitioning from their own platform (Vivaria) to Inspect suggests it meets the requirements of the most demanding evaluation use case.
  • Community still growing: 50+ contributors is healthy for a specialized tool but small compared to general-purpose ML frameworks.
  • Eval quality varies: The 100+ pre-built evals in Inspect Evals range from well-validated to experimental. Users should verify evaluation quality for their specific use case.
  • Python-only: If your evaluation infrastructure is not Python-based, integration may require adapters.