What It Does
SWE-bench is a benchmark for evaluating whether AI systems can resolve real-world GitHub issues. It presents an AI agent with a codebase and a natural-language issue description, then checks whether the agent’s proposed code changes pass the repository’s test suite. The benchmark was created at Princeton NLP (Carlos E. Jimenez, John Yang, and collaborators) and has become the de facto standard for evaluating AI coding agents.
The benchmark exists in several variants: the original SWE-bench (2,294 instances from 12 Python repositories), SWE-bench Lite (300 instances subset), SWE-bench Verified (500 human-validated instances, curated with OpenAI), SWE-bench Live (contamination-free rolling benchmark from post-training-cutoff issues), and SWE-bench Pro (Scale AI’s enhanced variant). The Verified variant has been the most widely reported in leaderboards through early 2026.
Key Features
- Real-world task instances derived from actual GitHub pull requests across 12 popular Python repositories (Django, Flask, scikit-learn, sympy, etc.)
- Automated evaluation via repository test suites — patches are applied and tests are executed to determine pass/fail
- Human-validated Verified subset (500 instances) filtering out ambiguous or poorly-specified issues
- SWE-bench Live variant providing contamination-free evaluation by sourcing issues after model training cutoffs
- Publicly hosted leaderboard at swebench.com tracking agent performance over time
- Hugging Face dataset hosting for easy programmatic access
- Integration with multiple agent frameworks (OpenHands, SWE-Agent, Aider, etc.)
Use Cases
- Evaluating and comparing AI coding agent capabilities on realistic software engineering tasks
- Tracking frontier model progress on autonomous code generation over time
- Research evaluation for new agent architectures and prompting strategies
- Vendor selection — comparing agent platforms on a standardized task set
Adoption Level Analysis
Small teams (<20 engineers): Useful as a reference point when evaluating which AI coding tool to adopt. Running the benchmark yourself requires significant compute and setup but the leaderboard results are freely accessible.
Medium orgs (20-200 engineers): Relevant for teams building or evaluating AI coding infrastructure. The benchmark framework can be extended for internal evaluation of custom agents.
Enterprise (200+ engineers): Important as a standardized evaluation framework, but enterprises should complement it with domain-specific evaluations. SWE-bench is Python-only and covers only issue resolution, which is a fraction of real software engineering work.
Alternatives
| Alternative | Key Difference | Prefer when… |
|---|---|---|
| HCAST (METR) | Measures autonomous task completion breadth, not just code patches | You want to evaluate general autonomous agent capabilities, not just code fixing |
| SWE-bench Pro (Scale AI) | Enhanced version with better quality control | You want fewer noisy/ambiguous instances |
| OpenHands Index | Multi-domain (issue resolution + greenfield + frontend + info gathering) | You want broader coverage of software engineering skills beyond bug fixing |
| Terminal-Bench | Evaluates CLI-based agent workflows | You specifically evaluate terminal-based coding agents |
Evidence & Sources
- SWE-bench ICLR 2024 Paper — original benchmark paper and methodology
- OpenAI: Why We No Longer Evaluate SWE-bench Verified — major criticism from OpenAI arguing the benchmark is no longer useful for frontier evaluation
- METR: Many SWE-bench-Passing PRs Would Not Be Merged (Mar 2026) — independent finding that ~50% of passing patches are not production-quality
- SWE-bench Goes Live (arXiv) — contamination-free variant addressing data leakage
- SWE-BENCH+ Enhanced Coding Benchmark (OpenReview) — analysis of answer leakage and weak test cases
- SWE-bench Verified Leaderboard (Epoch AI) — independent leaderboard tracking
Notes & Caveats
- Approaching saturation: Top agents now score 77-81% on SWE-bench Verified, approaching a ceiling. OpenAI has publicly stopped reporting Verified scores and recommends moving to SWE-bench Pro.
- Data contamination risk: SWE-bench tasks come from popular open-source repos (Django, scikit-learn) that are in most LLM training sets. SWE-bench Live was created specifically to address this, but scores drop dramatically on Live (~19% vs 60%+ on Verified).
- Python-only, issue-resolution-only: The benchmark covers only one programming language (Python) and one task type (fixing issues in existing codebases). It does not test greenfield development, frontend work, documentation, testing, or multi-language projects.
- Weak test validation: SWE-BENCH+ analysis found 22.6% of Verified instances have answer leakage problems and 15.2% have weak test cases that accept incorrect solutions. OpenAI audited a subset and found 59.4% had flawed test cases.
- METR quality assessment: METR independently found roughly half of test-passing SWE-bench PRs from 2024-2025 agents would not be accepted by repository maintainers, indicating the benchmark overestimates real-world utility.
- Git history shortcut: CMU researchers (using Hodoscope) found agents could access git history to copy original code patches, inflating scores. This was mitigated by switching to shallow clones.
- Score interpretation: SWE-bench scores reflect the combined agent+model system, not the model alone. The same model can score very differently depending on the agent harness, prompting strategy, and tool availability.
- Despite limitations, still the standard: SWE-bench remains the most widely cited benchmark for AI coding agents. No single alternative has achieved comparable adoption, making it a necessary (if imperfect) evaluation tool.