What It Does

Inspect AI is an open-source framework for large language model evaluations created by the UK AI Safety Institute (AISI). Open-sourced in May 2024, it provides a standardized way to build, run, and analyze evaluations measuring coding ability, agentic task completion, reasoning, knowledge, behavioral safety, and multi-modal understanding.

Inspect is now the recommended replacement for METR’s Vivaria platform. It has broader community adoption (50+ contributors from safety institutes, frontier labs, and research organizations) and comes with 100+ pre-built evaluations ready to run on any model. The framework includes a web-based viewer for monitoring evaluations, a VS Code extension for authoring/debugging, and flexible tool-calling support including MCP tools.

Key Features

100+ pre-built evaluations (Inspect Evals) covering safety, capability, reasoning, and coding domains
Straightforward Python interfaces for implementing custom evaluations
Flexible tool-calling: built-in bash, Python, text editing, web search, web browsing, computer tools, plus MCP tool support
Web-based Inspect View for monitoring and visualizing evaluation runs
VS Code Extension for authoring and debugging evaluations
Model-agnostic: works with any LLM provider
Composable evaluation components for reuse across projects
Agent evaluation support for multi-step, tool-using AI systems
Inspect Evals repository: community-contributed evaluations maintained collaboratively by UK AISI, Arcadia Impact, and Vector Institute
Active development with regular feature releases

Use Cases

AI safety evaluation: Assessing model capabilities for dangerous autonomous behaviors before deployment
Benchmark development: Creating reproducible evaluations with standardized scoring
Red-teaming: Running adversarial evaluations to find model failure modes
Regulatory compliance: Government and enterprise teams evaluating AI systems against safety standards
Research: Academic and industry researchers needing a flexible evaluation harness

Adoption Level Analysis

Small teams (<20 engineers): Fits well. Python-based, pip-installable, with pre-built evals that work out of the box. Low barrier to entry for basic model evaluation. The VS Code extension helps with development workflow.

Medium orgs (20-200 engineers): Good fit. Enough flexibility for custom evaluation development, and the community-maintained eval repository means less wheel reinvention. Reasonable learning curve.

Enterprise (200+ engineers): Strong fit. Backed by the UK government, which provides institutional stability. Multiple frontier labs and safety organizations already contribute. Suitable for compliance-oriented evaluation programs.

Alternatives

Alternative	Key Difference	Prefer when…
Vivaria (METR)	METR’s previous platform, now deprecated	Legacy investment only; migrate to Inspect
EleutherAI lm-evaluation-harness	Static model evals, no agent support	You need traditional benchmarking without agentic tasks
OpenAI Evals	OpenAI-ecosystem-specific	You are exclusively evaluating OpenAI models
Promptfoo	Developer-focused prompt testing	You need fast prompt iteration, not safety evaluation

Evidence & Sources

Notes & Caveats

Government-backed but open: UK AISI maintains Inspect but it is MIT-licensed with genuine community governance. The government backing provides stability but could create perception issues in some jurisdictions.
METR endorsement is a strong signal: METR transitioning from their own platform (Vivaria) to Inspect suggests it meets the requirements of the most demanding evaluation use case.
Community still growing: 50+ contributors is healthy for a specialized tool but small compared to general-purpose ML frameworks.
Eval quality varies: The 100+ pre-built evals in Inspect Evals range from well-validated to experimental. Users should verify evaluation quality for their specific use case.
Python-only: If your evaluation infrastructure is not Python-based, integration may require adapters.

Inspect AI

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

HCAST (Human-Calibrated Autonomy Software Tasks)

Humanity's Last Exam (HLE)

METR (Model Evaluation & Threat Research)

METR Task Standard