Skip to content

AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines

Ryan Greenblatt April 7, 2026 opinion medium credibility
View source

AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines

Source: LessWrong / Redwood Research Blog | Author: Ryan Greenblatt | Published: ~March 2026 Category: opinion | Credibility: medium

Executive Summary

  • Ryan Greenblatt (Chief Scientist, Redwood Research) has nearly doubled his estimated probability of “full AI R&D automation by EOY 2028” from ~15% to just under 30%, citing unexpectedly strong frontier model performance in early 2026.
  • The update is driven by observed capabilities on what he calls “ES tasks” (easy-and-cheap-to-verify SWE tasks requiring minimal novel ideation) — typified by Anthropic’s demonstration of 16 Claude Opus 4.6 agents building a 100,000-line C compiler for $20,000 with no human code contribution.
  • He cites METR’s measured ~3.5-month doubling time on 50%-reliability time horizons in 2025, and a “big jump at the start of 2026,” predicting 2026 progress will outpace 2025 due to training compute scaling and “scaffolding overhang.”

Critical Analysis

Claim: “AI systems will have a 50%-reliability time horizon of years to decades on reasonably difficult ES tasks by EOY 2026”

  • Evidence quality: anecdotal + author extrapolation from METR benchmarks
  • Assessment: Greenblatt is extrapolating from the METR time-horizon trend line (3.5-month doubling time) combined with his own model-performance observations. The trend is independently corroborated — METR’s TH1.1 (January 2026) measured a post-2023 doubling time of 131 days (~4.4 months), and a SWE-Bench Verified derived dataset shows under 3 months. However, “years to decades” is a dramatic projection that lies well beyond the current range of METR’s calibrated task suite (max ~8-10 hours per task), requiring extrapolation from a logistic regression model past all calibrated data points. METR themselves have cautioned about this extrapolation limit.
  • Counter-argument: METR’s own August 2025 research found that 38% algorithmic success rates on benchmarks corresponded to 0% mergeable PRs in real codebases — a fundamental gap between benchmark performance and actual utility. Extrapolating a “years of work” time horizon from a benchmark calibrated only up to 8-hour tasks involves at minimum two untested assumptions: that the log-linear trend continues into much longer tasks, and that the algorithmic-vs-holistic gap does not widen as task complexity grows. Repo maintainers have also been shown to be 5–18x faster than METR’s baseline testers, which inflates apparent AI capability relative to expert human workers.
  • References:

Claim: “16 Claude Opus 4.6 agents autonomously wrote a C compiler — a major demonstration of large-scale ES task completion”

  • Evidence quality: case-study (Anthropic internal, publicly described)
  • Assessment: The Anthropic engineering blog post by Nicholas Carlini confirms the broad facts: 16 Claude Code agents working in parallel on a shared Git repo produced a 100,000-line Rust-based C compiler capable of building Linux 6.9 across x86, ARM, and RISC-V, at a cost of ~$20,000 over ~2,000 sessions. This is a genuine capability demonstration. Greenblatt’s framing as “minimal human guidance” is broadly supported.
  • Counter-argument: Multiple independent reviews (The Register, InfoQ) note that the “autonomy” was materially bounded: a human researcher (Carlini) defined the architecture, set the project scope, built the testing harness and CI pipeline, and intervened at dead ends. InfoQ’s analysis observed that “much of the real work that made the project function involved designing the environment around the AI model agents rather than writing compiler code directly.” The $20,000 cost is also a non-trivial factor: this is not a task an average developer would run routinely, and economic viability matters for claims about AI supplanting human software work. The compiler was also built in Rust by agents trained on large corpora of compiler implementations — this is an ES task in Greenblatt’s own framing (well-specified, fully verifiable), and does not straightforwardly generalize to tasks requiring architectural novelty.
  • References:

Claim: “Probability of full AI R&D automation by EOY 2028 is now just below 30% (up from ~15%)”

  • Evidence quality: anecdotal (personal probability estimate, not a formal model)
  • Assessment: This is an expert opinion from a credible AI safety researcher at Redwood Research, which has published serious work on AI control and model evaluation. The update is internally coherent — if time-horizon doubling continues at 3.5–4.5 months, reaching “AI R&D capable” levels by 2028 is a plausible extrapolation. However, “full AI R&D automation” is a loosely defined threshold. Greenblatt himself distinguishes ES tasks (well-specified, verifiable) from general R&D, which requires substantial novel ideation. A system that can write a compiler is meaningfully different from one that can generate and evaluate novel research hypotheses.
  • Counter-argument: Surveyed AI researchers as a population give 50% probability of “full labor automation” in the 2100s, and automating AI research specifically in the 2060s (EA Forum analysis of expert surveys). While insider observers at frontier labs plausibly have informational advantages, there are also well-documented biases: proximity to capability gains encourages short-timeline updates; selection effects mean LessWrong/EA writers skew toward timeline-shortening views; and frontier lab employees who regularly see impressive demos may systematically underweight deployment friction, economic constraints, and skill-transfer limits. The ~3x gap between this estimate and median expert surveys is meaningful and unexplained in the article.
  • References:

Claim: “Progress in 2026 will be a decent amount faster than 2025, driven by training compute scaling and scaffolding overhang”

  • Evidence quality: anecdotal (prediction, not yet observed)
  • Assessment: The claim that pretraining compute will scale substantially in 2026 is corroborated by publicly available data center investment announcements from Microsoft, Google, and Amazon. The “scaffolding overhang” argument — that current models are underutilized because scaffolding has not caught up — is a legitimate structural observation that many practitioners have noted independently. However, this is a forward-looking claim made in March 2026, meaning it cannot yet be verified.
  • Counter-argument: MIT NANDA reporting that 95% of generative AI pilots stall before driving revenue suggests that scaffolding improvements have not reliably translated to real-world productivity gains at scale. The gap between frontier model capability and enterprise deployment is consistent with a scaffolding overhang, but also consistent with harder-to-solve problems: data integration, organizational change management, regulatory compliance, and the irreducible cost of context-specific fine-tuning. “Scaffolding overhang” may be a real phenomenon that nevertheless yields only marginal capability gains if the bottleneck is elsewhere.
  • References:

Credibility Assessment

  • Author background: Ryan Greenblatt is Chief Scientist at Redwood Research, a 501(c)(3) AI safety nonprofit with ~$21M in funding from Open Philanthropy and SFF. He has substantial LessWrong karma (23,506) and has been cited extensively in AI safety literature. He is a serious technical researcher, not a hype commentator.
  • Publication bias: LessWrong and the Redwood Research blog are community-moderated but reflect the views of the AI safety/EA community, which has a well-documented selection bias toward shorter timelines and near-term AI risk scenarios. The audience and writing norms on LessWrong make doom-adjacent updates socially comfortable in a way that may subtly influence framing.
  • Verdict: medium — The author is credible and the factual claims about model performance are broadly corroborated by independent sources. The forward-looking probability estimates are coherent but represent one informed expert’s view, not a consensus or formal model. The evidence for “years to decades time horizon by EOY 2026” relies on extrapolation beyond METR’s calibrated range. The 30% R&D automation estimate is 3x the median expert survey and the gap deserves more explanation than the article provides.