What It Does

SWE-bench is a benchmark for evaluating whether AI systems can resolve real-world GitHub issues. It presents an AI agent with a codebase and a natural-language issue description, then checks whether the agent’s proposed code changes pass the repository’s test suite. The benchmark was created at Princeton NLP (Carlos E. Jimenez, John Yang, and collaborators) and has become the de facto standard for evaluating AI coding agents.

The benchmark exists in several variants: the original SWE-bench (2,294 instances from 12 Python repositories), SWE-bench Lite (300 instances subset), SWE-bench Verified (500 human-validated instances, curated with OpenAI), SWE-bench Live (contamination-free rolling benchmark from post-training-cutoff issues), and SWE-bench Pro (Scale AI’s enhanced variant). The Verified variant has been the most widely reported in leaderboards through early 2026.

Key Features

Real-world task instances derived from actual GitHub pull requests across 12 popular Python repositories (Django, Flask, scikit-learn, sympy, etc.)
Automated evaluation via repository test suites — patches are applied and tests are executed to determine pass/fail
Human-validated Verified subset (500 instances) filtering out ambiguous or poorly-specified issues
SWE-bench Live variant providing contamination-free evaluation by sourcing issues after model training cutoffs
Publicly hosted leaderboard at swebench.com tracking agent performance over time
Hugging Face dataset hosting for easy programmatic access
Integration with multiple agent frameworks (OpenHands, SWE-Agent, Aider, etc.)

Use Cases

Evaluating and comparing AI coding agent capabilities on realistic software engineering tasks
Tracking frontier model progress on autonomous code generation over time
Research evaluation for new agent architectures and prompting strategies
Vendor selection — comparing agent platforms on a standardized task set

Adoption Level Analysis

Small teams (<20 engineers): Useful as a reference point when evaluating which AI coding tool to adopt. Running the benchmark yourself requires significant compute and setup but the leaderboard results are freely accessible.

Medium orgs (20-200 engineers): Relevant for teams building or evaluating AI coding infrastructure. The benchmark framework can be extended for internal evaluation of custom agents.

Enterprise (200+ engineers): Important as a standardized evaluation framework, but enterprises should complement it with domain-specific evaluations. SWE-bench is Python-only and covers only issue resolution, which is a fraction of real software engineering work.

Alternatives

Alternative	Key Difference	Prefer when…
HCAST (METR)	Measures autonomous task completion breadth, not just code patches	You want to evaluate general autonomous agent capabilities, not just code fixing
SWE-bench Pro (Scale AI)	Enhanced version with better quality control	You want fewer noisy/ambiguous instances
OpenHands Index	Multi-domain (issue resolution + greenfield + frontend + info gathering)	You want broader coverage of software engineering skills beyond bug fixing
Terminal-Bench	Evaluates CLI-based agent workflows	You specifically evaluate terminal-based coding agents

Evidence & Sources

SWE-bench ICLR 2024 Paper — original benchmark paper and methodology
OpenAI: Why We No Longer Evaluate SWE-bench Verified — major criticism from OpenAI arguing the benchmark is no longer useful for frontier evaluation
METR: Many SWE-bench-Passing PRs Would Not Be Merged (Mar 2026) — independent finding that ~50% of passing patches are not production-quality
SWE-bench Goes Live (arXiv) — contamination-free variant addressing data leakage
SWE-BENCH+ Enhanced Coding Benchmark (OpenReview) — analysis of answer leakage and weak test cases
SWE-bench Verified Leaderboard (Epoch AI) — independent leaderboard tracking

Notes & Caveats

Approaching saturation: Top agents now score 77-81% on SWE-bench Verified, approaching a ceiling. OpenAI has publicly stopped reporting Verified scores and recommends moving to SWE-bench Pro.
Data contamination risk: SWE-bench tasks come from popular open-source repos (Django, scikit-learn) that are in most LLM training sets. SWE-bench Live was created specifically to address this, but scores drop dramatically on Live (~19% vs 60%+ on Verified).
Python-only, issue-resolution-only: The benchmark covers only one programming language (Python) and one task type (fixing issues in existing codebases). It does not test greenfield development, frontend work, documentation, testing, or multi-language projects.
Weak test validation: SWE-BENCH+ analysis found 22.6% of Verified instances have answer leakage problems and 15.2% have weak test cases that accept incorrect solutions. OpenAI audited a subset and found 59.4% had flawed test cases.
METR quality assessment: METR independently found roughly half of test-passing SWE-bench PRs from 2024-2025 agents would not be accepted by repository maintainers, indicating the benchmark overestimates real-world utility.
Git history shortcut: CMU researchers (using Hodoscope) found agents could access git history to copy original code patches, inflating scores. This was mitigated by switching to shallow clones.
Score interpretation: SWE-bench scores reflect the combined agent+model system, not the model alone. The same model can score very differently depending on the agent harness, prompting strategy, and tool availability.
Despite limitations, still the standard: SWE-bench remains the most widely cited benchmark for AI coding agents. No single alternative has achieved comparable adoption, making it a necessary (if imperfect) evaluation tool.

SWE-bench

At a Glance

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

HCAST (Human-Calibrated Autonomy Software Tasks)

Humanity's Last Exam (HLE)

LiveCodeBench

MMLU (Massive Multitask Language Understanding)