Skip to content

Fish Speech — SOTA Open Source TTS

Fish Audio (fishaudio) April 7, 2026 product-announcement medium credibility
View source

Fish Speech — SOTA Open Source TTS

Source: GitHub — fishaudio/fish-speech | Author: Fish Audio | Published: 2024-11-01 (ongoing) Category: product-announcement | Credibility: medium

Executive Summary

  • Fish Speech is an open-source multilingual TTS system by Fish Audio (39 AI, Inc.) combining a Dual-Autoregressive (Dual-AR) architecture with GRPO reinforcement learning alignment, claiming top-of-leaderboard scores on the Seed-TTS and EmergentTTS benchmarks as of early 2026.
  • The model supports 80+ languages, zero-shot voice cloning from 10–30 seconds of reference audio, fine-grained emotion control via natural-language tags, and a real-time factor of 0.195 on an NVIDIA H200.
  • The license is a custom “Fish Audio Research License” that explicitly prohibits commercial use without a separate paid agreement — meaning it is not truly open-source despite being hosted publicly on GitHub.

Critical Analysis

Claim: “Best overall WER on Seed-TTS Eval (0.54% Chinese, 0.99% English)”

  • Evidence quality: vendor-sponsored
  • Assessment: The benchmark figures appear in the repo README and the accompanying arXiv paper (arXiv:2411.01156) and are also echoed in a separate technical report (arXiv:2603.08823v1). The Seed-TTS evaluation framework is published and reproducible, which gives it more credibility than a purely self-reported number. Independent commentary on Hugging Face confirms strong performance. However, the benchmarks were run and reported by Fish Audio itself with no independent third-party reproduction of these specific results found at time of review.
  • Counter-argument: Seed-TTS Eval is primarily a word-error-rate metric and does not capture perceptual naturalness, emotional nuance, or prosodic consistency across long-form generation. A 0.54% WER is impressive but can coexist with audible artifacts, unnatural rhythm, or failure modes on low-resource languages in the “80+” set. The TTS-Arena leaderboard (community preference votes) is a more practical signal, and Fish Speech S1 topped it in October 2025, which is modestly corroborating. WER also depends heavily on the ASR model used for evaluation, creating methodological variability.
  • References:

Claim: “Real-time factor of 0.195 on NVIDIA H200 — approximately 100ms time-to-first-audio”

  • Evidence quality: vendor-sponsored
  • Assessment: The H200 is a $25,000+ top-tier data center GPU. The 0.195 RTF is plausible for a well-optimised model on high-end hardware but is not representative of typical deployment scenarios. Community reports indicate VRAM requirements of ~17 GB for S2 Pro inference, making consumer GPU deployment (RTX 3090 or lower) significantly slower. Newer GPU architectures (RTX 5000 series, sm_120) require non-standard PyTorch CUDA builds and are unsupported by default.
  • Counter-argument: The 100ms latency figure is likely only achievable with the COMPILE=1 flag (not supported on Windows/macOS) and a beefy GPU. Real-world latency for teams without data center GPU access will be materially higher. No independent latency benchmark on representative hardware (A100, RTX 4090) was found.
  • References:

Claim: “Over 10 million hours of training data covering 80+ languages”

  • Evidence quality: vendor-sponsored
  • Assessment: 10 million hours is an extraordinary figure (for context, Whisper large was trained on 680k hours). Fish Audio operates a commercial voice marketplace that could plausibly aggregate large volumes of user-submitted audio. However, no independent audit of this claim exists, and the training data provenance is undisclosed. The quality and coverage across the “70+ additional languages” beyond the Tier 1/Tier 2 set is unverified.
  • Counter-argument: Quantity of training data does not guarantee quality, speaker diversity, or absence of copyrighted material. The lack of a data card or independent dataset audit is a notable gap for enterprise adoption, especially in regulated industries. Models trained on marketplace-sourced audio carry potential rights and consent ambiguity.
  • References:

Claim: “SOTA across both open-source and closed-source TTS systems”

  • Evidence quality: vendor-sponsored
  • Assessment: The EmergentTTS win rate of 81.88% vs unnamed competitors and the Seed-TTS leaderboard position are cited. The TTS-Arena community leaderboard — a more transparent peer comparison — did place Fish Speech S1 at the top in October 2025. Competitors such as Qwen3-TTS, MiniMax Speech-02, and ElevenLabs are not independently benchmarked by third parties in a unified controlled comparison.
  • Counter-argument: “SOTA” in TTS is a rapidly moving target. ElevenLabs, PlayHT, and Cartesia operate closed commercial products with dedicated research teams. Comparison with XTTS v2 (Coqui, now abandoned) is not an apples-to-apples quality benchmark. The project also lacks a standardised perceptual quality benchmark (e.g. MUSHRA) with human evaluators that is independently reproducible.
  • References:

Claim: “Apache 2.0 licensed / open source”

  • Evidence quality: vendor-sponsored
  • Assessment: This claim is misleading. The repository README markets the project as open-source, but the actual LICENSE file in the repository (Fish Audio Research License, last updated March 7, 2026) explicitly prohibits commercial use without a separate written agreement from 39 AI, Inc. Commercial use includes “hosting services or APIs”, “internal business operations”, and “generating revenue (directly or indirectly)”. Outputs cannot be used to train other generative AI models. This is source-available, not open-source by OSI definition.
  • Counter-argument: Hugging Face community discussions (fishaudio/fish-speech-1.5 #11) flag this explicitly: “That is not free software, as you forbid commercial use.” Commercial licensing is available via business@fish.audio but requires a separate paid agreement. Teams evaluating this as a drop-in Apache 2.0 alternative to ElevenLabs are at legal risk.
  • References:

Credibility Assessment

  • Author background: Fish Audio is the commercial product arm of 39 AI, Inc., operating a voice marketplace at fish.audio. The organisation is a for-profit company using the open-source repository as a research showcase and lead-generation channel for their commercial API. The repo has 29k+ stars suggesting broad community interest, but the article/README is vendor-written self-promotion.
  • Publication bias: Vendor-authored GitHub README and associated commercial documentation. The accompanying arXiv paper (arXiv:2411.01156) adds some academic framing but is also authored by the Fish Audio team without external peer review at the time of this article review.
  • Verdict: medium — the benchmarks appear largely legitimate and the model has genuine independent endorsement through TTS-Arena rankings, but the “open source” claim is materially false and all performance claims originate from the vendor. Real-world deployment complexity and license restrictions are significant factors that the README minimises.