Fish Audio S2 Pro TTS

Fish Audio S2 Pro TTS.

0· 0·0 当前·0 累计

by @openlark (OpenLark)·MIT-0

生产力工具

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install fish-speech

镜像加速npx clawhub@latest install fish-speech --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

Fish Audio S2 Pro TTS

Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.

模型: fishaudio/s2-pro 输出: 44.1 kHz WAV/PCM mono VRAM: ≥24GB for inference, A800/H200 recommended Technical 报告: arXiv 2603.08823 | Architecture 安装ation

See references/安装.md. Quick summary:

conda 创建 -n fish-speech python=3.12 && conda activate fish-speech pip 安装 -e .[cu129] # CUDA 12.9 # or: uv 同步 --python 3.12 --extra cu129 # minimal: pip 安装 fish-speech

apt 安装 portaudio19-dev libsox-dev ffmpeg # 系统 dependencies hf 下载 fishaudio/s2-pro --local-dir 检查points/s2-pro

Server 部署ment

vLLM-Omni (recommended, OpenAI compatible):

pip 安装 fish-speech vllm serve fishaudio/s2-pro --omni --port 8091 # 端点s: POST /v1/audio/speech, /v1/audio/speech/batch

SGLang-Omni (high-performance 流ing):

sgl-omni serve --模型-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000 # RTF 0.195, TTFA ~100ms, throughput 3000+ t/s

Docker:

docker compose --性能分析网页ui up # Port 7860 COMPILE=1 docker compose --性能分析网页ui up # ~10x speedup

Native API Server:

python 工具s/API_server.py --llama-检查point-path 检查points/s2-pro --decoder-检查point-path 检查points/s2-pro/codec.pth --列出en 0.0.0.0:8080

Raw 命令行工具 Inference (Three Steps) # 1. 提取 VQ 令牌s python fish_speech/模型s/dac/inference.py -i "ref.wav" --检查point-path "检查points/s2-pro/codec.pth" # 2. 生成 semantic 令牌s python fish_speech/模型s/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-令牌s "fake.npy" # 3. Decode to audio python fish_speech/模型s/dac/inference.py -i "codes_0.npy"

API Calls cURL # Basic TTS curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "Hello."}' --输出 out.wav

# Voice cloning (vLLM) curl -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --输出 cloned.wav

# 流ing PCM curl -N -X POST http://localhost:8091/v1/audio/speech \ -H "Content-Type: 应用/json" \ -d '{"输入": "流ing.", "流": true, "响应_格式化": "pcm"}' --no-buffer | play -t raw -r 44100 -e 签名ed -b 16 -c 1 -

# Batch curl -X POST http://localhost:8091/v1/audio/speech/batch \ -H "Content-Type: 应用/json" \ -d '{"items": [{"输入": "Sentence 1"}, {"输入": "Sentence 2"}], "voice": "default"}'

Python 导入请求s resp = 请求s.post("http://localhost:8091/v1/audio/speech", json={ "输入": "Hello.", "voice": "default", "ref_audio": "https://...", "ref_text": "Reference text" }) with open("out.wav", "wb") as f: f.write(resp.content)

# OpenAI SDK from openAI 导入 OpenAI 命令行工具ent = OpenAI(base_url="http://localhost:8091/v1", API_key="none") 命令行工具ent.audio.speech.创建(模型="fishaudio/s2-pro", voice="default", 输入="Hello.").流_to_file("out.wav")

SGLang 格式化: "references": [{"audio_path": "...", "text": "..."}]

请求 Parameters Parameter Type Default Description 输入 string Required Text to synthesize voice string "default" Voice 响应_格式化 string "wav" wav/mp3/flac/pcm/aac/opus speed float 1.0 Speech speed (0.25-4.0) 流 bool false 流ing (requires 响应_格式化="pcm") ref_audio string null Reference audio URL/base64/file:// ref_text string null Reference audio transcription max_new_令牌s int 2048 Max generation 令牌s temperature float null Sampling temperature top_p float null Nucleus sampling top_k int null Top-K repetition_penalty float null Repetition penalty 种子 int null Random 种子 Emotion Tags

Embed [tag] anywhere in the text, supports 15000+ free-form tags:

[excited]Today is a great day![暂停] [whisper in small voice]But there's a secret… [professional broadcast tone]Welcome.

Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [暂停] [emphasis] [echo] [inhale] [sigh] [singing]

Full reference: references/emotion-tags.md

Multi-Speaker <|speaker:0|>Hello, welcome. <|speaker:1|>Thank you, glad to be here.

LoRA Fine-tuning

⚠️ Not recommended for 模型s after RL. Only fine-调优 Slow AR:

# Preparation: data/SPK1/.mp3 + .lab python 工具s/vqgan/提取_vq.py data --config-name modded_dac_vq --检查point-path 检查points/openaudio-s1-mini/codec.pth python 工具s/llama/build_data设置.py --输入 data --输出 data/protos python fish_speech/trAIn.py --config-name text2semantic_fine调优 project=my_project +lora@模型.模型.lora_config=r_8_alpha_16 python 工具s/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight 检查points/openaudio-s1-mini --lora-weight 结果s/my_project/检查points/step_xxx.ckpt --输出检查points/merged/

See references/fine调优.md

导入ant Notes Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription Without reference audio, voice tends to sound

License

运行时依赖

安装命令

技能文档

相关技能推荐