Skip to content

Fish Speech

★ New
assess
AI / ML open-source Fish Audio Research License (non-commercial; commercial use requires separate agreement) freemium

At a Glance

Open-source multilingual text-to-speech system by Fish Audio using a Dual-Autoregressive architecture and reinforcement learning alignment, achieving top-tier benchmark scores across 80+ languages with voice cloning from short reference audio.

Type
open-source
Pricing
freemium
License
Fish
Adoption fit
small, medium
Top alternatives

Fish Speech

Source: GitHub | Docs: speech.fish.audio | License: Fish Audio Research License (non-commercial)

What It Does

Fish Speech is a multilingual text-to-speech inference system developed by Fish Audio (39 AI, Inc.). It generates natural-sounding speech across 80+ languages using a Dual-Autoregressive (Dual-AR) architecture — a slow 4B-parameter transformer for semantic prediction coupled with a fast 400M-parameter transformer for acoustic detail generation. The architecture is post-trained with Group Relative Policy Optimization (GRPO) reinforcement learning alignment, producing speech with fine-grained prosody and emotion control.

The system supports zero-shot voice cloning from 10–30 seconds of reference audio, multi-speaker generation, and multi-turn conversation synthesis. Fine-grained control of emotion and speaking style uses 15,000+ natural-language tags (e.g. [whisper], [excited], [angry]) embedded directly in the input text.

Key Features

  • Dual-Autoregressive architecture: slow semantic AR (4B params) + fast acoustic AR (400M params) working in series
  • Zero-shot voice cloning from 10–30 second reference audio sample
  • 80+ language support with Tier 1 quality for English, Chinese, Japanese; Tier 2 for Korean, Spanish, Portuguese, Arabic, Russian, French, German
  • 15,000+ natural-language emotion/style tags for fine-grained prosody control
  • Real-time factor of 0.195 on NVIDIA H200 (~100ms time-to-first-audio with COMPILE=1 flag)
  • RVQ audio codec with 10 codebooks and GFSQ implementation
  • SGLang integration for accelerated inference serving
  • Docker deployment support; WebUI and CLI interfaces available
  • GRPO post-training alignment for multi-dimensional reward signals

Use Cases

  • Research and prototyping: Evaluating state-of-the-art TTS quality for academic or non-commercial projects where the license restriction is acceptable
  • Voice cloning R&D: Rapid voice adaptation from short reference audio, useful when building bespoke TTS pipelines for internal non-commercial use
  • Multilingual content generation: Generating voiceovers across many languages from a single model, avoiding the need to maintain separate per-language models
  • Emotion-rich narration: Podcasts, audiobooks, or interactive fiction systems where expressive speech with controllable emotion is required
  • Gateway to Fish Audio commercial API: Evaluating locally before committing to the commercial API offering at fish.audio

Adoption Level Analysis

Small teams (<20 engineers): Fits for non-commercial experimentation. GPU requirement (~17 GB VRAM for S2 Pro) limits deployment to teams with access to at least an RTX 3090/4090 or cloud GPU. COMPILE=1 for peak performance requires Linux only. Do not use commercially without a separate license from Fish Audio.

Medium orgs (20–200 engineers): Fits for internal research or non-commercial product prototyping. Production deployment requires dedicated GPU infrastructure and ops overhead. The non-commercial license creates legal risk if any revenue-generating product is involved — most medium orgs with commercial intent should route through the fish.audio API or negotiate a commercial license.

Enterprise (200+ engineers): Does not fit for self-hosted commercial deployment under the default license. The training data provenance is undisclosed, creating IP risk in regulated industries. Enterprises should use the commercial API or obtain a formal license agreement from 39 AI, Inc.

Alternatives

AlternativeKey DifferencePrefer when…
XTTS v2 (Coqui)Coqui Public Model License; project abandoned Dec 2024 after funding collapseYou need a legacy model already in production; not for new projects
F5-TTSMIT-licensed; flow matching architectureYou need a permissively licensed voice cloning model without commercial restrictions
ElevenLabsFully commercial, closed source, polished APIYou need production-grade TTS with SLA, commercial license, and no self-hosting burden
Cartesia SonicLow-latency streaming TTS; commercial APIYou need sub-100ms streaming latency at scale
Chatterbox (Resemble AI)Apache 2.0 licensed voice cloningYou need an OSI-compliant commercial-use voice cloning model
Kokoro TTSApache 2.0; smaller model (82M params)You need fast CPU-viable TTS with no license friction

Evidence & Sources

Notes & Caveats

  • License trap: The repository markets itself as “open source” but the Fish Audio Research License explicitly prohibits commercial use. Commercial use includes hosting any API, internal business operations, and generating revenue. This is source-available, not open-source by OSI definition. Teams that deploy this in production commercially without a paid license from Fish Audio are at legal risk. Issue #531 on GitHub and HuggingFace community discussion confirm this is a known point of confusion.
  • Weights cannot train competing models: The license forbids using model outputs to train other foundational generative AI models, which limits its use in data augmentation pipelines.
  • GPU requirements: Approximately 17 GB VRAM for S2 Pro inference. COMPILE=1 fast mode requires Linux and manual Triton installation. RTX 5000 series (sm_120) has compatibility issues requiring non-standard PyTorch CUDA builds.
  • Training data opacity: The “10 million hours” claim is unaudited. Data provenance, speaker consent, and copyright status are undisclosed — a risk for organisations with strict IP compliance requirements.
  • Coqui TTS comparison: Coqui AI (the main prior open-source TTS competitor) shut down in December 2025, which has driven interest toward Fish Speech as an alternative. However, F5-TTS and Kokoro TTS are more permissively licensed alternatives for commercial use cases.
  • Benchmark self-reporting: All WER and win-rate benchmarks are reported by Fish Audio. No independent third-party laboratory reproduction of the Seed-TTS evaluation was found at time of review.

Related