Caveman: A Claude Code Skill That Reduces Output Tokens by Making Claude Talk Like a Caveman

Item: Caveman
Rating: 1
Author: altexs

Source: GitHub — JuliusBrussee/caveman | Author: Julius Brussee | Published: 2026-03-01 Category: product-announcement | Credibility: low

Executive Summary

Caveman is an open-source Claude Code skill (MIT license) that instructs Claude to respond in terse, article-dropped, caveman-style prose while preserving code blocks, technical terms, and error messages verbatim.
The project self-reports a 65% average output token reduction (range 22–87%) across 10 real-world software engineering tasks, with no reproducible methodology or independent replication disclosed.
The author acknowledged on Hacker News that the project was “intended as a joke” initially and that the headline 75% figure “needs proper benchmarking before credibility” — a significant caveat for any team considering it for cost reduction.

Critical Analysis

Claim: “~65% average output token savings across standard software engineering tasks”

Evidence quality: vendor-sponsored (self-reported by the project author with no reproducible methodology)
Assessment: The benchmark table on the companion site (juliusbrussee.github.io/caveman) lists 5 tasks with per-task figures ranging from 41% to 87%. These are point measurements from a single run with no variance statistics, no baseline prompt controls, and no description of how “normal” response length was established. The author himself wrote on Hacker News that “the preliminary ~75% token reduction claim needs proper benchmarking before credibility.” The claim is directionally plausible — stripping filler language will reduce tokens — but the specific figure is not independently validated.
Counter-argument: Token reduction percentage is highly sensitive to the verbosity of the baseline response, which varies by model version, system prompt, task type, and conversation history. A 65% reduction on hand-picked tasks may not hold across a representative sample. There is no A/B testing methodology or statistical confidence interval.
References:
- Hacker News thread — Caveman: Why use many token when few token do trick
- Caveman companion site benchmarks

Claim: “Maintains 100% technical accuracy in compressed responses”

Evidence quality: anecdotal
Assessment: The project shows qualitative examples (e.g., a React re-render explanation compressed from 69 to 19 tokens) where the technical content appears equivalent. However, no systematic accuracy evaluation exists. The Hacker News community raised concerns that constraining model output style may degrade reasoning quality. One commenter noted: “constraining an LLM to speak in any way other than the default way it wants to speak reduces its intelligence / reasoning capacity.” This is consistent with research on constrained decoding and instruction following.
Counter-argument: A March 2026 paper (“Brevity Constraints Reverse Performance Hierarchies in Language Models,” arXiv:2604.00025) actually found that forcing brevity constraints on large models improved accuracy by 26 percentage points on certain benchmarks — the opposite of the degradation concern. However, that paper studied conciseness constraints at the decoding level, not style-mimicry instructions, so the mechanism differs.
References:
- Brevity Constraints Reverse Performance Hierarchies in Language Models (arXiv:2604.00025)
- Hacker News critical comments on intelligence degradation

Claim: “Output token reduction translates to meaningful cost savings”

Evidence quality: anecdotal
Assessment: This is the most contested claim in the Hacker News discussion. Multiple commenters argued that output tokens are not the primary cost driver in agentic workloads — input tokens (which accumulate across the context window in multi-turn agents) are the actual bottleneck. The Caveman Compress companion tool (which compresses CLAUDE.md files by ~45%) addresses input tokens, but the core Caveman skill only affects output. In typical Claude Code usage, output tokens represent a smaller fraction of total cost than the concatenated input context.
Counter-argument: For developers using Claude Code interactively in short sessions (not long-running agent loops), output token reduction does translate to real latency and cost reduction. The claim is valid for this narrower use case, but not for agentic workloads where input context dominates.
References:
- HN commenter: “you’re never gonna blow your token budget on output. Input tokens are the bottleneck.”
- Incorporating Token Usage into Prompting Strategy Evaluation (arXiv)

Claim: “Caveman Compress reduces input tokens by ~45% via CLAUDE.md compression”

Evidence quality: anecdotal
Assessment: The companion caveman-compress tool rewrites project memory files (CLAUDE.md) into compressed caveman-speak. This directly addresses input token costs since CLAUDE.md is loaded at the start of every session. The 45% figure is self-reported with no methodology. The tool creates a CLAUDE.original.md backup, which is a responsible design choice. The primary risk is that compressed instructions may be harder for humans to read and maintain — trading developer experience for token savings.
Counter-argument: Project memory files vary enormously in size and information density. A 45% average compression on a CLAUDE.md may hide cases where critical nuance is lost in compression, leading to subtler agent errors that are harder to diagnose than cost overruns.
References:
- Caveman GitHub repository — caveman-compress section

Credibility Assessment

Author background: Julius Brussee is an independent developer who also built Blueprint (spec-driven Claude development) and Revu (macOS study app). No affiliation with Anthropic or any AI research institution. No prior peer-reviewed publications.
Publication bias: This is a self-published GitHub repository. No editorial review, peer review, or independent replication. The Hacker News thread (which generated significant attention) surfaced substantive criticisms that the README does not address.
Verdict: low — Directionally interesting idea with plausible mechanism, but all quantitative claims are self-reported by the author with no reproducible methodology. The author himself disclosed on Hacker News that the headline figures require proper benchmarking. Treat as a fun experiment, not a validated engineering tool.

Referenced in catalog