Fish Speech

Source: GitHub | Docs: speech.fish.audio | License: Fish Audio Research License (non-commercial)

What It Does

Fish Speech is a multilingual text-to-speech inference system developed by Fish Audio (39 AI, Inc.). It generates natural-sounding speech across 80+ languages using a Dual-Autoregressive (Dual-AR) architecture — a slow 4B-parameter transformer for semantic prediction coupled with a fast 400M-parameter transformer for acoustic detail generation. The architecture is post-trained with Group Relative Policy Optimization (GRPO) reinforcement learning alignment, producing speech with fine-grained prosody and emotion control.

The system supports zero-shot voice cloning from 10–30 seconds of reference audio, multi-speaker generation, and multi-turn conversation synthesis. Fine-grained control of emotion and speaking style uses 15,000+ natural-language tags (e.g. [whisper], [excited], [angry]) embedded directly in the input text.

Key Features

Dual-Autoregressive architecture: slow semantic AR (4B params) + fast acoustic AR (400M params) working in series
Zero-shot voice cloning from 10–30 second reference audio sample
80+ language support with Tier 1 quality for English, Chinese, Japanese; Tier 2 for Korean, Spanish, Portuguese, Arabic, Russian, French, German
15,000+ natural-language emotion/style tags for fine-grained prosody control
Real-time factor of 0.195 on NVIDIA H200 (~100ms time-to-first-audio with COMPILE=1 flag)
RVQ audio codec with 10 codebooks and GFSQ implementation
SGLang integration for accelerated inference serving
Docker deployment support; WebUI and CLI interfaces available
GRPO post-training alignment for multi-dimensional reward signals

Use Cases

Research and prototyping: Evaluating state-of-the-art TTS quality for academic or non-commercial projects where the license restriction is acceptable
Voice cloning R&D: Rapid voice adaptation from short reference audio, useful when building bespoke TTS pipelines for internal non-commercial use
Multilingual content generation: Generating voiceovers across many languages from a single model, avoiding the need to maintain separate per-language models
Emotion-rich narration: Podcasts, audiobooks, or interactive fiction systems where expressive speech with controllable emotion is required
Gateway to Fish Audio commercial API: Evaluating locally before committing to the commercial API offering at fish.audio

Adoption Level Analysis

Small teams (<20 engineers): Fits for non-commercial experimentation. GPU requirement (~17 GB VRAM for S2 Pro) limits deployment to teams with access to at least an RTX 3090/4090 or cloud GPU. COMPILE=1 for peak performance requires Linux only. Do not use commercially without a separate license from Fish Audio.

Medium orgs (20–200 engineers): Fits for internal research or non-commercial product prototyping. Production deployment requires dedicated GPU infrastructure and ops overhead. The non-commercial license creates legal risk if any revenue-generating product is involved — most medium orgs with commercial intent should route through the fish.audio API or negotiate a commercial license.

Enterprise (200+ engineers): Does not fit for self-hosted commercial deployment under the default license. The training data provenance is undisclosed, creating IP risk in regulated industries. Enterprises should use the commercial API or obtain a formal license agreement from 39 AI, Inc.

Alternatives

Alternative	Key Difference	Prefer when…
XTTS v2 (Coqui)	Coqui Public Model License; project abandoned Dec 2024 after funding collapse	You need a legacy model already in production; not for new projects
F5-TTS	MIT-licensed; flow matching architecture	You need a permissively licensed voice cloning model without commercial restrictions
ElevenLabs	Fully commercial, closed source, polished API	You need production-grade TTS with SLA, commercial license, and no self-hosting burden
Cartesia Sonic	Low-latency streaming TTS; commercial API	You need sub-100ms streaming latency at scale
Chatterbox (Resemble AI)	Apache 2.0 licensed voice cloning	You need an OSI-compliant commercial-use voice cloning model
Kokoro TTS	Apache 2.0; smaller model (82M params)	You need fast CPU-viable TTS with no license friction

Evidence & Sources

Notes & Caveats

License trap: The repository markets itself as “open source” but the Fish Audio Research License explicitly prohibits commercial use. Commercial use includes hosting any API, internal business operations, and generating revenue. This is source-available, not open-source by OSI definition. Teams that deploy this in production commercially without a paid license from Fish Audio are at legal risk. Issue #531 on GitHub and HuggingFace community discussion confirm this is a known point of confusion.
Weights cannot train competing models: The license forbids using model outputs to train other foundational generative AI models, which limits its use in data augmentation pipelines.
GPU requirements: Approximately 17 GB VRAM for S2 Pro inference. COMPILE=1 fast mode requires Linux and manual Triton installation. RTX 5000 series (sm_120) has compatibility issues requiring non-standard PyTorch CUDA builds.
Training data opacity: The “10 million hours” claim is unaudited. Data provenance, speaker consent, and copyright status are undisclosed — a risk for organisations with strict IP compliance requirements.
Coqui TTS comparison: Coqui AI (the main prior open-source TTS competitor) shut down in December 2025, which has driven interest toward Fish Speech as an alternative. However, F5-TTS and Kokoro TTS are more permissively licensed alternatives for commercial use cases.
Benchmark self-reporting: All WER and win-rate benchmarks are reported by Fish Audio. No independent third-party laboratory reproduction of the Seed-TTS evaluation was found at time of review.

Fish Speech

At a Glance

Fish Speech

What It Does

Key Features

Use Cases

Adoption Level Analysis

Alternatives

Evidence & Sources

Notes & Caveats

Related

Fish Audio

HeyGen