运行时依赖
安装命令
点击复制技能文档
Fish Audio S2 Pro TTS
Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.
模型: fishaudio/s2-pro 输出: 44.1 kHz WAV/PCM mono VRAM: ≥24GB for inference, A800/H200 recommended Technical 报告: arXiv 2603.08823 | Architecture 安装ation
See references/安装.md. Quick summary:
conda 创建 -n fish-speech python=3.12 && conda activate fish-speech pip 安装 -e .[cu129] # CUDA 12.9 # or: uv 同步 --python 3.12 --extra cu129 # minimal: pip 安装 fish-speech
apt 安装 portaudio19-dev libsox-dev ffmpeg # 系统 dependencies hf 下载 fishaudio/s2-pro --local-dir 检查points/s2-pro
Server 部署ment
vLLM-Omni (recommended, OpenAI compatible):
pip 安装 fish-speech vllm serve fishaudio/s2-pro --omni --port 8091 # 端点s: POST /v1/audio/speech, /v1/audio/speech/batch
SGLang-Omni (high-performance 流ing):
sgl-omni serve --模型-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000 # RTF 0.195, TTFA ~100ms, throughput 3000+ t/s
Docker:
docker compose --性能分析 网页ui up # Port 7860 COMPILE=1 docker compose --性能分析 网页ui up # ~10x speedup
Native API Server:
python 工具s/API_server.py --llama-检查point-path 检查points/s2-pro --decoder-检查point-path 检查points/s2-pro/codec.pth --列出en 0.0.0.0:8080
Raw 命令行工具 Inference (Three Steps) # 1. 提取 VQ 令牌s python fish_speech/模型s/dac/inference.py -i "ref.wav" --检查point-path "检查points/s2-pro/codec.pth" # 2. 生成 semantic 令牌s python fish_speech/模型s/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-令牌s "fake.npy" # 3. Decode to audio python fish_speech/模型s/dac/inference.py -i "codes_0.npy"
API Calls cURL # Basic TTS curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "Hello."}' --输出 out.wav
# Voice cloning (vLLM) curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --输出 cloned.wav
# 流ing PCM curl -N -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "流ing.", "流": true, "响应_格式化": "pcm"}' --no-buffer | play -t raw -r 44100 -e 签名ed -b 16 -c 1 -
# Batch curl -X POST http://localhost:8091/v1/audio/speech/batch \ -H "Content-Type: 应用/json" \ -d '{"items": [{"输入": "Sentence 1"}, {"输入": "Sentence 2"}], "voice": "default"}'
Python 导入 请求s resp = 请求s.post("http://localhost:8091/v1/audio/speech", json={ "输入": "Hello.", "voice": "default", "ref_audio": "https://...", "ref_text": "Reference text" }) with open("out.wav", "wb") as f: f.write(resp.content)
# OpenAI SDK from openAI 导入 OpenAI 命令行工具ent = OpenAI(base_url="http://localhost:8091/v1", API_key="none") 命令行工具ent.audio.speech.创建(模型="fishaudio/s2-pro", voice="default", 输入="Hello.").流_to_file("out.wav")
SGLang 格式化: "references": [{"audio_path": "...", "text": "..."}]
请求 Parameters Parameter Type Default Description 输入 string Required Text to synthesize voice string "default" Voice 响应_格式化 string "wav" wav/mp3/flac/pcm/aac/opus speed float 1.0 Speech speed (0.25-4.0) 流 bool false 流ing (requires 响应_格式化="pcm") ref_audio string null Reference audio URL/base64/file:// ref_text string null Reference audio transcription max_new_令牌s int 2048 Max generation 令牌s temperature float null Sampling temperature top_p float null Nucleus sampling top_k int null Top-K repetition_penalty float null Repetition penalty 种子 int null Random 种子 Emotion Tags
Embed [tag] anywhere in the text, supports 15000+ free-form tags:
[excited]Today is a great day![暂停] [whisper in small voice]But there's a secret… [professional broadcast tone]Welcome.
Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [暂停] [emphasis] [echo] [inhale] [sigh] [singing]
Full reference: references/emotion-tags.md
Multi-Speaker <|speaker:0|>Hello, welcome. <|speaker:1|>Thank you, glad to be here.
LoRA Fine-tuning
⚠️ Not recommended for 模型s after RL. Only fine-调优 Slow AR:
# Preparation: data/SPK1/.mp3 + .lab python 工具s/vqgan/提取_vq.py data --config-name modded_dac_vq --检查point-path 检查points/openaudio-s1-mini/codec.pth python 工具s/llama/build_data设置.py --输入 data --输出 data/protos python fish_speech/trAIn.py --config-name text2semantic_fine调优 project=my_project +lora@模型.模型.lora_config=r_8_alpha_16 python 工具s/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight 检查points/openaudio-s1-mini --lora-weight 结果s/my_project/检查points/step_xxx.ckpt --输出 检查points/merged/
See references/fine调优.md
导入ant Notes Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription Without reference audio, voice tends to sound