Simple Self-Distillation (SSD)

What It Does

Simple Self-Distillation (SSD) is a post-training technique for large language models where the model is fine-tuned using only samples it generates itself — no external labels, verifiers, or teacher models required. The process samples N solutions per problem using elevated temperature and optional top-p truncation, then applies standard supervised fine-tuning (cross-entropy loss) on those samples. At inference time, the fine-tuned model is deployed with the standard evaluation decoding settings.

The technique was introduced by Apple researchers (Zhang et al., 2026) and targets the code generation domain. The authors attribute the gains to a “precision-exploration conflict” in LLM decoding: fixed decoding temperatures are a global compromise between positions requiring high precision (syntax-constrained “locks”) and positions requiring genuine exploration (“forks”). SSD is claimed to reshape token distributions asymmetrically — suppressing distractors at precision-critical positions while preserving diversity at ambiguous positions — though this causal mechanism is not directly measured and remains contested.

Key Features

No verifier, execution environment, or teacher model required — only the model and a set of problem prompts
Works with elevated training temperature (Ttrain) and nucleus sampling truncation (top-p) to encourage diverse samples
Compatible with both instruct and thinking variants of models (Qwen3, Llama-3.1 tested)
Single-round SSD; iterative application not studied
Gains concentrate on harder problems (LiveCodeBench v6 hard quartile: +15.3pp for 30B-Instruct)
Pass@5 gains exceed pass@1 gains, suggesting solution diversity is preserved
Pathologically noisy training data (62% samples containing no extractable code) still yields measurable gains
Reported improvements: Qwen3-30B-Instruct +12.9pp, Qwen3-4B-Instruct +7.5pp, Llama-3.1-8B +3.5pp on LiveCodeBench v6

Use Cases

Use case 1: Post-training improvement on code generation when execution infrastructure is unavailable or undesirable
Use case 2: Low-cost alternative to RLHF or execution-based reinforcement learning for domain-specific fine-tuning
Use case 3: Improving instruct models before deployment where instruction-following quality matters more than general coding

Adoption Level Analysis

Small teams (<20 engineers): May not fit — requires GPU fine-tuning infrastructure (paper uses 8×B200 GPUs with Megatron-LM) and assumes access to training-scale compute. Smaller teams are unlikely to have fine-tuning pipelines for 30B parameter models. For 4B–8B scale models, it may be feasible with cloud GPU rental, but the ops burden is non-trivial.

Medium orgs (20–200 engineers): Partial fit — teams with an existing MLOps or model-serving team could trial SSD on a smaller model (4B–8B). The technique is simple in principle (sample + SFT), but setting up a reproducible pipeline requires tuning temperature, truncation, sample count, and fine-tuning hyperparameters. Independent reproduction guidance from Apple is limited to a code repository; documented failure modes are minimal.

Enterprise (200+ engineers): Potential fit for ML platform teams that already run post-training pipelines. SSD requires significantly less infrastructure than RLHF or execution-based RL, which is an operational advantage. However, the technique has not been tested beyond 30B parameters and has not been peer-reviewed as of April 2026.

Alternatives

Alternative	Key Difference	Prefer when…
Execution-based RL (e.g., GRPO)	Uses execution feedback as reward signal	Ground-truth correctness verification is possible and infrastructure is available
Rejection Sampling Fine-Tuning (RFT)	Filters self-generated samples by correctness	Execution environment available; want verified training data
Knowledge Distillation from teacher	Uses a stronger teacher model’s outputs	A stronger teacher model exists and API access is available
Temperature-only decoding tuning	No training required, inference-time only	Compute budget for fine-tuning is unavailable; gains are smaller (~1.5–3pp)

Evidence & Sources

Notes & Caveats

Missing baseline ablation: The paper does not directly compare against sampling the base model with the same temperature and truncation settings used for SSD training (e.g., T=1.6, top-p=0.8 at inference time without fine-tuning). This is the most important missing experiment; without it, the training contribution to the gain cannot be cleanly isolated from the inference-time decoding contribution.
Benchmark focus: All primary results are on LiveCodeBench (competitive programming problems from LeetCode, AtCoder, Codeforces). Out-of-domain generalization is tested on math and code understanding tasks but gains are smaller and less consistent.
Single-round only: The paper does not study iterative SSD rounds. Prior work (Shumailov et al., 2024) documents model collapse when iteratively training on own outputs; whether SSD avoids this for >1 round is unknown.
Known regression risk for reasoning: Independent 2025 research (arXiv:2603.24472) found self-distillation can degrade mathematical reasoning by suppressing epistemic verbalization, with performance drops of up to 40% observed across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct. Whether SSD avoids this specific failure mode in the code domain is not addressed.
Non-commercial license: The Apple ml-ssd code is released under Apple’s Sample Code License, which restricts commercial use. Teams considering production deployment must use the technique independently without the Apple reference implementation.
No peer review as of April 2026: This is an arXiv preprint. The mechanism claims (locks/forks framing) should be treated as a hypothesis pending peer review.

Simple Self-Distillation (SSD)

At a Glance

Simple Self-Distillation (SSD)

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

Hugging Face Transformers

LiveCodeBench

Megatron-LM

OLMo 2