Skip to content

Train separately, merge together: Modular post-training with mixture-of-experts

Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min April 22, 2026 research high credibility
View source

Train separately, merge together: Modular post-training with mixture-of-experts

Source: Allen Institute for AI (Ai2) | Author: Jacob Morrison, Sanjay Adhikesaven, Akshita Bhagia, Matei Zaharia, Noah A. Smith, Sewon Min | Published: 2026-04-20 Category: research | Credibility: high

Executive Summary

  • BAR (Branch-Adapt-Route) is a three-stage modular post-training approach from Ai2 that branches from a pretrained base model, adapts each branch as a domain-specific expert through the full post-training pipeline, and routes between them via a mixture-of-experts architecture at inference time.
  • The method achieves an overall score of 49.1 (vs. 50.5 for full retraining and 46.7 for the BTX dense-expert baseline), demonstrating that modular training closes roughly 80% of the gap over non-modular post-training while enabling independent expert replacement with no degradation to other domains.
  • A critical finding is that shared parameters — embeddings, language-modeling head, and attention layers — must be progressively unfrozen during post-training stages; freezing them entirely (as in the pretraining-era FlexOlmo approach) produces broken models because behavioral shifts from instruction-following and RLHF require shared-layer plasticity.

Critical Analysis

Claim: “BAR achieves 49.1 overall, competitive with full retraining (50.5), while enabling independent domain updates”

  • Evidence quality: benchmark
  • Assessment: The numbers come from Ai2’s own evaluation suite across 19 benchmarks in 7 categories, which is a reasonable breadth but is self-reported without independent replication. The gap versus full retraining (1.4 points overall) is real but non-trivial — it represents a meaningful capability deficit on individual domains. The math gain (+7.8) and code gain (+4.7) over post-training-only figures suggest genuine domain specialization.
  • Counter-argument: All benchmarks are selected and run by the same team that built BAR. The “19 benchmarks across 7 categories” set is not independently audited. Full retraining remains the performance ceiling, and the 1.4-point gap may represent a larger delta on tasks not covered by the evaluation set. Independent reproduction on third-party eval harnesses (e.g., lm-evaluation-harness from EleutherAI) is absent from this release.
  • References:

Claim: “Replacing a code expert improves code performance by +16.5 points while leaving all other domains essentially unchanged”

  • Evidence quality: benchmark
  • Assessment: This is the most practically compelling claim in the paper — if true, it validates the core value proposition of modularity: independent upgradability without catastrophic interference. The result is directionally consistent with how sparse MoE routing theoretically works, and the routing architecture ensures that non-code inputs never activate the replaced expert.
  • Counter-argument: The evaluation is performed on the same 19-benchmark suite used throughout the paper. It is unclear whether the “essentially unchanged” claim holds for tasks with meaningful code-math or code-tool-use overlap. The paper does not report confidence intervals or variance across multiple training seeds, which is a standard requirement for publication-quality claims about small deltas.
  • References:

Claim: “The prior FlexOlmo pretraining approach fails during post-training because behavioral shifts require changes to shared parameters”

  • Evidence quality: case-study
  • Assessment: This is presented as an empirical discovery — freezing all shared layers (appropriate for pretraining where each expert’s feed-forward layers suffice) causes near-total failure during post-training (dense model merging score: 6.5 overall). The progressive unfreezing schedule (embeddings+LM head during SFT, all attention layers during RL) is a concrete engineering contribution grounded in the failure mode.
  • Counter-argument: The “6.5 overall score” baseline for naive dense merging is suspiciously catastrophic, suggesting the comparison may be against a strawman (literal weight averaging of full checkpoints rather than any principled merging). The BAR paper does not include other merging baselines such as TIES-Merging, DARE, or linear mode connectivity approaches that have shown better results than naive averaging.
  • References:

Claim: “Domain-only training data produces strong in-domain results but severely degrades general capabilities; mixing with general SFT data is essential”

  • Evidence quality: benchmark
  • Assessment: This is a well-established finding in the broader LLM post-training literature, not unique to BAR. The paper confirms a known phenomenon: domain-specific fine-tuning without general-purpose data causes capability regression on out-of-domain tasks (sometimes called “alignment tax” or “catastrophic forgetting”). The contribution here is demonstrating it holds in the modular MoE context.
  • Counter-argument: The appropriate mixing ratio between domain-specific and general SFT data is not derivable from first principles and likely varies by domain, model scale, and base model quality. BAR does not provide a principled method for determining this ratio — practitioners would need to tune it empirically, adding operational cost that is not discussed.
  • References:

Claim: “BAR enables each expert to be developed, upgraded, or replaced without touching the others”

  • Evidence quality: case-study
  • Assessment: Demonstrated for a single expert replacement (code) in one model family (OLMo 2). The claim is presented as a general capability, but the evidence is a single swap in a controlled academic setting. Real-world independent expert development across organizations — the promise implied by the FlexOlmo lineage — introduces synchronization, compatibility, and routing calibration challenges not addressed in this post.
  • Counter-argument: The Berkeley Function Calling Leaderboard (BFCL) score improvement from 20.3 to 46.4 (attributed to parameter unfreezing, not to expert replacement) suggests some of the headline numbers conflate different ablations. Practitioners considering BAR for multi-organization expert development will find the paper lacks guidance on versioning, router retraining triggers, and quality gating for expert updates.
  • References:

Credibility Assessment

  • Author background: Jacob Morrison (Ai2 research scientist, OLMo team), Sewon Min (Ai2 senior researcher, NLP), Noah A. Smith (University of Washington professor, long-standing Ai2 collaborator), Matei Zaharia (UC Berkeley EECS professor, co-creator of Spark and Ray — unusual inclusion suggesting infrastructure-aware design intent). The team has a strong publication track record in open LLM development.
  • Publication bias: Ai2 official blog — this is vendor/institution content. The work has not yet appeared in a peer-reviewed venue at time of review (April 2026), though the OLMo 2 base work was published at COLM 2025. Code and model checkpoints are open-sourced via Hugging Face, which supports independent reproducibility in principle.
  • Verdict: high — Ai2 is a credible, non-profit, fully-open AI research institution with a track record of releasing genuine artifacts (code, weights, data). The claims are benchmark-backed and the failure modes (density merging, frozen shared layers) are reported honestly. Independent reproduction is feasible given open checkpoints. Key gaps: no third-party evaluation, no confidence intervals, limited scope (one model family, one expert swap).

Entities Extracted

EntityTypeCatalog Entry
OLMo 2open-sourcedata/catalog/frameworks/olmo2.md
FlexOlmoopen-sourcedata/catalog/frameworks/flexolmo.md
Mixture-of-Experts (MoE)patterndata/catalog/patterns/mixture-of-experts.md
Allen Institute for AI (Ai2)vendordata/catalog/vendors/allenai.md