Skip to content

RE-Bench

★ New
assess
AI / ML vendor MIT open-source

At a Glance

AI benchmark suite from METR for evaluating autonomous AI agent capabilities on real-world research engineering tasks.

Type
vendor
Pricing
open-source
License
MIT
Adoption fit
medium, enterprise
Top alternatives

What It Does

RE-Bench (Research Engineering Benchmark) is an evaluation suite developed by METR (Model Evaluation and Threat Research) for measuring AI agents’ capabilities on real-world research engineering tasks. It assesses how well AI systems can autonomously perform tasks like machine learning experimentation, data analysis, and software engineering in realistic environments.

RE-Bench provides standardized, reproducible evaluations that go beyond typical coding benchmarks by testing end-to-end research workflows including hypothesis formation, implementation, experimentation, and analysis. It is used by AI labs and safety organizations to track autonomous agent capabilities.

Key Features

  • Real-world research tasks: Evaluations based on actual research engineering workflows, not synthetic benchmarks
  • Autonomous agent testing: Measures multi-step, autonomous task completion without human guidance
  • Standardized task format: Uses METR’s Task Standard for reproducible evaluation
  • Time-limited evaluation: Tasks have defined time budgets, measuring efficiency alongside capability
  • Open-source infrastructure: Task definitions and evaluation framework are publicly available

Use Cases

  • AI labs evaluating autonomous agent capabilities for safety assessments
  • Researchers benchmarking new agent architectures against a standardized suite
  • Policymakers assessing the pace of AI autonomy advancement
  • Organizations evaluating AI coding agents for research engineering workflows

Adoption Level Analysis

Small teams (<20 engineers): Limited direct applicability unless building or evaluating AI agents.

Medium orgs (20–200 engineers): Useful for teams building AI agent products who need standardized capability benchmarks.

Enterprise (200+ engineers): Valuable for AI labs and large organizations conducting responsible scaling assessments.

Alternatives

AlternativeKey DifferencePrefer when…
SWE-benchFocuses on GitHub issue resolutionYou want to measure code-level bug fixing ability specifically
HumanEvalCode generation from docstringsYou need simple function-level code generation benchmarks
GAIAGeneral AI assistant benchmarkYou want broader assistant capabilities beyond research engineering

Evidence & Sources

Notes & Caveats

  • RE-Bench results depend on the agent scaffold and tool access, not just the underlying model
  • Benchmark scores may not reflect real-world productivity improvements
  • METR is a nonprofit AI safety organization; the benchmark is designed with safety evaluation in mind
  • The benchmark suite evolves; results across different versions may not be directly comparable

Related