What It Does

RE-Bench (Research Engineering Benchmark) is an evaluation suite developed by METR (Model Evaluation and Threat Research) for measuring AI agents’ capabilities on real-world research engineering tasks. It assesses how well AI systems can autonomously perform tasks like machine learning experimentation, data analysis, and software engineering in realistic environments.

RE-Bench provides standardized, reproducible evaluations that go beyond typical coding benchmarks by testing end-to-end research workflows including hypothesis formation, implementation, experimentation, and analysis. It is used by AI labs and safety organizations to track autonomous agent capabilities.

Key Features

Real-world research tasks: Evaluations based on actual research engineering workflows, not synthetic benchmarks
Autonomous agent testing: Measures multi-step, autonomous task completion without human guidance
Standardized task format: Uses METR’s Task Standard for reproducible evaluation
Time-limited evaluation: Tasks have defined time budgets, measuring efficiency alongside capability
Open-source infrastructure: Task definitions and evaluation framework are publicly available

Use Cases

AI labs evaluating autonomous agent capabilities for safety assessments
Researchers benchmarking new agent architectures against a standardized suite
Policymakers assessing the pace of AI autonomy advancement
Organizations evaluating AI coding agents for research engineering workflows

Adoption Level Analysis

Small teams (<20 engineers): Limited direct applicability unless building or evaluating AI agents.

Medium orgs (20–200 engineers): Useful for teams building AI agent products who need standardized capability benchmarks.

Enterprise (200+ engineers): Valuable for AI labs and large organizations conducting responsible scaling assessments.

Alternatives

Alternative	Key Difference	Prefer when…
SWE-bench	Focuses on GitHub issue resolution	You want to measure code-level bug fixing ability specifically
HumanEval	Code generation from docstrings	You need simple function-level code generation benchmarks
GAIA	General AI assistant benchmark	You want broader assistant capabilities beyond research engineering

Evidence & Sources

Notes & Caveats

RE-Bench results depend on the agent scaffold and tool access, not just the underlying model
Benchmark scores may not reflect real-world productivity improvements
METR is a nonprofit AI safety organization; the benchmark is designed with safety evaluation in mind
The benchmark suite evolves; results across different versions may not be directly comparable

RE-Bench

At a Glance