LiveCodeBench
What It Does
LiveCodeBench is an LLM evaluation benchmark specifically designed to resist data contamination. Unlike static benchmarks (HumanEval, MBPP) that become embedded in training data over time, LiveCodeBench continuously harvests new competitive programming problems from LeetCode, AtCoder, and Codeforces — platforms that publish new problems on a rolling basis. Problems released after a model’s training cutoff cannot be in its training data, providing a cleaner signal of genuine capability.
The benchmark tracks multiple versioned snapshots (v1 through v6 as of early 2026), allowing longitudinal comparison across model generations. Each version includes a new cohort of problems collected over a defined time window. LiveCodeBench v6 (Feb–May 2025) contains 142 problems; v5 (Aug 2024–Feb 2025) contains 374 problems. Problems span three difficulty tiers: easy, medium, and hard, sourced from competitive programming platforms with community difficulty ratings.
Key Features
- Contamination resistance: problems are collected post-training-cutoff on a rolling basis; older models are evaluated on problems from after their respective cutoffs
- Versioned snapshots: multiple dated versions enable year-over-year model comparison without score incompatibility
- Four evaluation scenarios: code generation (primary), self-repair, code execution, and test output prediction
- Difficulty stratification: easy / medium / hard tiers matching competitive programming conventions
- pass@1 and pass@k metrics: both reported; pass@5 tracks diversity of generated solutions
- Publicly available: full problem sets and evaluation harness available on GitHub
- Used as primary benchmark in frontier model papers (Apple SSD, DeepSeek, Qwen3 evaluations)
Use Cases
- Use case 1: Evaluating code generation models on problems likely not in training data, when contamination is a primary concern
- Use case 2: Longitudinal tracking of a model family’s progress across training runs (v5 vs v6 comparison)
- Use case 3: Research papers needing a community-standard coding benchmark with difficulty stratification
Adoption Level Analysis
Small teams (<20 engineers): Fits — the evaluation harness is open-source, problems are public, and running evaluations requires only the model endpoint and Python tooling. Useful for any team evaluating which open-weight coding model to deploy.
Medium orgs (20–200 engineers): Fits — appropriate for model selection decisions and validating fine-tuning improvements. The versioned structure makes it easy to maintain consistent evaluation protocols across team members.
Enterprise (200+ engineers): Fits for ML platform teams as part of an evaluation suite, though enterprises typically need additional domain-specific benchmarks that reflect their actual codebase and task types rather than competitive programming alone.
Alternatives
| Alternative | Key Difference | Prefer when… |
|---|---|---|
| SWE-bench | Evaluates on real GitHub issues requiring codebase navigation | Testing agent-level capability (find, edit, run) rather than isolated code generation |
| HumanEval / MBPP | Simpler problems, likely contaminated in training data | Quick sanity check; not for frontier model discrimination |
| LiveCodeBench Pro | Harder (Codeforces/ICPC/IOI problems); frontier models score near 0% on hard | Testing absolute capability ceiling; not for comparing instruct models |
| Humanity’s Last Exam (HLE) | Multi-domain, not code-specific | Broader capability assessment across STEM domains |
Evidence & Sources
- LiveCodeBench GitHub
- LiveCodeBench leaderboard (Artificial Analysis)
- LiveCodeBench Pro paper (arXiv:2506.11928)
- BenchLM SWE-bench & LiveCodeBench leaderboard (March 2026)
Notes & Caveats
- Narrow domain: LiveCodeBench exclusively tests competitive programming style problems (algorithmic, data structures, combinatorics). Scores do not predict performance on real engineering tasks, API usage, codebase navigation, or multi-file edits. Treat scores as one signal among several.
- Top scores approaching ceiling: As of April 2026, top models (Gemini 3 Pro Preview) score 91.7% on LiveCodeBench overall. Easy problems are approaching saturation for frontier models. The benchmark’s discriminating power is migrating toward hard problems and the LiveCodeBench Pro variant.
- Version incompatibility: v5 and v6 scores are not directly comparable due to different problem sets; papers frequently report different versions, making leaderboard comparisons across papers error-prone.
- Competitive programming gap from real-world coding: Even the highest-scoring models (53% pass@1 on medium problems in LiveCodeBench Pro without tools, 0% on hard) perform well below human competitive programmers. Results reflect benchmark-specific capability, not general software engineering competence.