Alternatives to SWE-bench

SWE-bench and 3 alternative tools evaluated on the Tekai technology radar.

SWE-bench

Subject

A benchmark evaluating whether AI agents can resolve real-world GitHub issues by generating code patches that pass repository test suites.

open-source MIT

assess

METR's primary benchmark measuring frontier AI autonomous software task completion, calibrated against 140 human experts across 189 tasks.

A 2,500-question expert-level benchmark curated by ~1,000 specialists to measure AI capabilities where frontier models still score 40-50%.

An open-source LLM evaluation framework by the UK AI Safety Institute with 100+ pre-built evals for safety, coding, reasoning, and agent assessment.

Tool	Radar	Type	License
SWE-bench	assess	open-source	MIT
HCAST (Human-Calibrated Autonomy Software Tasks)	assess	open-source	MIT
Humanity's Last Exam (HLE)	assess	open-source	CC-BY-4.0
Inspect AI	trial	open-source	MIT