LiveCodeBench

What It Does

LiveCodeBench is an LLM evaluation benchmark specifically designed to resist data contamination. Unlike static benchmarks (HumanEval, MBPP) that become embedded in training data over time, LiveCodeBench continuously harvests new competitive programming problems from LeetCode, AtCoder, and Codeforces — platforms that publish new problems on a rolling basis. Problems released after a model’s training cutoff cannot be in its training data, providing a cleaner signal of genuine capability.

The benchmark tracks multiple versioned snapshots (v1 through v6 as of early 2026), allowing longitudinal comparison across model generations. Each version includes a new cohort of problems collected over a defined time window. LiveCodeBench v6 (Feb–May 2025) contains 142 problems; v5 (Aug 2024–Feb 2025) contains 374 problems. Problems span three difficulty tiers: easy, medium, and hard, sourced from competitive programming platforms with community difficulty ratings.

Key Features

Contamination resistance: problems are collected post-training-cutoff on a rolling basis; older models are evaluated on problems from after their respective cutoffs
Versioned snapshots: multiple dated versions enable year-over-year model comparison without score incompatibility
Four evaluation scenarios: code generation (primary), self-repair, code execution, and test output prediction
Difficulty stratification: easy / medium / hard tiers matching competitive programming conventions
pass@1 and pass@k metrics: both reported; pass@5 tracks diversity of generated solutions
Publicly available: full problem sets and evaluation harness available on GitHub
Used as primary benchmark in frontier model papers (Apple SSD, DeepSeek, Qwen3 evaluations)

Use Cases

Use case 1: Evaluating code generation models on problems likely not in training data, when contamination is a primary concern
Use case 2: Longitudinal tracking of a model family’s progress across training runs (v5 vs v6 comparison)
Use case 3: Research papers needing a community-standard coding benchmark with difficulty stratification

Adoption Level Analysis

Small teams (<20 engineers): Fits — the evaluation harness is open-source, problems are public, and running evaluations requires only the model endpoint and Python tooling. Useful for any team evaluating which open-weight coding model to deploy.

Medium orgs (20–200 engineers): Fits — appropriate for model selection decisions and validating fine-tuning improvements. The versioned structure makes it easy to maintain consistent evaluation protocols across team members.

Enterprise (200+ engineers): Fits for ML platform teams as part of an evaluation suite, though enterprises typically need additional domain-specific benchmarks that reflect their actual codebase and task types rather than competitive programming alone.

Alternatives

Alternative	Key Difference	Prefer when…
SWE-bench	Evaluates on real GitHub issues requiring codebase navigation	Testing agent-level capability (find, edit, run) rather than isolated code generation
HumanEval / MBPP	Simpler problems, likely contaminated in training data	Quick sanity check; not for frontier model discrimination
LiveCodeBench Pro	Harder (Codeforces/ICPC/IOI problems); frontier models score near 0% on hard	Testing absolute capability ceiling; not for comparing instruct models
Humanity’s Last Exam (HLE)	Multi-domain, not code-specific	Broader capability assessment across STEM domains

Evidence & Sources

Notes & Caveats

Narrow domain: LiveCodeBench exclusively tests competitive programming style problems (algorithmic, data structures, combinatorics). Scores do not predict performance on real engineering tasks, API usage, codebase navigation, or multi-file edits. Treat scores as one signal among several.
Top scores approaching ceiling: As of April 2026, top models (Gemini 3 Pro Preview) score 91.7% on LiveCodeBench overall. Easy problems are approaching saturation for frontier models. The benchmark’s discriminating power is migrating toward hard problems and the LiveCodeBench Pro variant.
Version incompatibility: v5 and v6 scores are not directly comparable due to different problem sets; papers frequently report different versions, making leaderboard comparisons across papers error-prone.
Competitive programming gap from real-world coding: Even the highest-scoring models (53% pass@1 on medium problems in LiveCodeBench Pro without tools, 0% on hard) perform well below human competitive programmers. Results reflect benchmark-specific capability, not general software engineering competence.

LiveCodeBench

At a Glance

LiveCodeBench

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

Simple Self-Distillation (SSD)

SWE-bench

Vera