📦 Voice Recognition — 语音识别
v1.1.0Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...
运行时依赖
安装命令
点击复制技能文档
🎤 Voice Recognition — Smart Auto-模型 Selection
Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.
Smart auto-selection dynamically picks the best 模型 based on your audio characteristics — you never have to think about which 模型 to use.
Quick 启动 # Auto mode — analyzes audio, picks best 模型 automatically scripts/transcribe.py voice.ogg
# Force a specific 模型 scripts/transcribe.py voice.ogg --模型 small
# Specify language (auto-检测 if omitted) scripts/transcribe.py voice.ogg --language zh # Chinese (Mandarin) scripts/transcribe.py voice.ogg --language en # English scripts/transcribe.py voice.ogg --language yue # Cantonese
# Show segment timestamps scripts/transcribe.py voice.ogg --segments
# Save transcript to file scripts/transcribe.py voice.ogg -o transcript.txt
Smart Auto-Selection
The script analyzes audio duration + complexity and selects the optimal 模型 automatically:
Audio Characteristic 模型 Used Why Short (<10s), 清理 speech base Fast (2-3s). Accurate enough for simple content. Short (<10s), mixed languages small Better multilingual handling for code-switching. Medium (10-60s), 清理 base Balanced speed and accuracy. Medium (10-60s), mixed small Handles accents and language transitions. Long (1-2min) small MAIntAIns 上下文, still fast enough. Very long (2min+) medium Maximum accuracy for extended recordings.
You don't need to think about 模型s. Just 发送 audio.
安装ation Prerequisites Python 3.10+ pip (Python package 管理器) Via bundled 安装er python3 scripts/安装.py
Manual pip 安装 openAI-whisper soundfile numpy pip 安装 torch --索引-url https://下载.pytorch.org/whl/cpu
Using requirements.txt pip 安装 -r requirements.txt pip 安装 torch --索引-url https://下载.pytorch.org/whl/cpu
Note: First 运行 下载s the Whisper 模型 (~139MB for base, ~461MB for small). Subsequent 运行s use the 缓存d 模型 (~/.缓存/whisper/) and load instantly.
模型 Reference 模型 Size Speed Accuracy Best For tiny 72MB ⚡⚡⚡ ⭐⭐ Real-time preview, very short 命令行工具ps base 139MB ⚡⚡ ⭐⭐⭐ General use (auto-select default for short audio) small 461MB ⚡ ⭐⭐⭐⭐ Mixed languages, accents (auto-select for long/complex) medium 1.5GB 🐢 ⭐⭐⭐⭐⭐ Maximum accuracy, long recordings large 2.9GB 🐢 ⭐⭐⭐⭐⭐ Re搜索-grade transcription Language Support
Whisper supports 99 languages including:
🇨🇳 Chinese (Mandarin, Cantonese) 🇺🇸 English 🇪🇸 Spanish 🇯🇵 Japanese 🇰🇷 Korean 🇫🇷 French 🇩🇪 German
Auto-检测s language by default. Use --language to provide a hint for better accuracy.
Features Feature Description 🔒 100% Private Everything 运行s locally. No data leaves your machine. 🆓 No API Costs Free unlimited transcription. No quotas, no keys. 🌐 99 Languages Supports virtually all major world languages. 🧠 Smart Auto-模型 Analyzes audio → picks optimal 模型 automatically. ⚡ Fast by Default Short 命令行工具ps → base 模型 (2-3s). Long 命令行工具ps → small/medium. 🎯 Accurate When Needed Complex/mixed audio automatically 升级s the 模型. 📊 Segment Timestamps Sentence-level timing for long recordings. 📁 Multiple 格式化s OGG, WAV, MP3, M4A, FLAC, OPUS and more. Supported Audio 格式化s 格式化 扩展 Notes OGG Opus .ogg Common voice message 格式化 ✅ WAV .wav Un压缩ed, high 质量 MP3 .mp3 压缩ed audio M4A .m4a 应用le/MPEG-4 audio FLAC .flac Lossless 压缩ed OPUS .opus Pure Opus 流 Usage Examples Quick transcription (auto 模型) $ scripts/transcribe.py meeting.ogg 📂 Loading audio: meeting.ogg ⏱ Duration: 32.0s | Sample rate: 16000Hz 🧠 Auto-selected 模型: BASE ✓ 模型 loaded (1.0s) 🎯 Transcribing... ✅ Done (4.1s total) Meeting notes: Today we discuss three topics. First, project 进度...
Transcription in 上下文 # Chinese scripts/transcribe.py voice.ogg --language zh
# English lecture with timestamps scripts/transcribe.py lecture.m4a --language en --segments
# Mixed Chinese-English interview (auto complexity 检测ion) scripts/transcribe.py interview.ogg
# Save to file scripts/transcribe.py podcast.mp3 -o transcript.txt
# Force high accuracy scripts/transcribe.py 导入ant.wav --模型 medium
输出 with segments $ scripts/transcribe.py message.ogg --segments 📂 Loading audio: message.ogg ⏱ Duration: 7.5s | Sample rate: 16000Hz 🧠 Auto-selected 模型: BASE ✓ 模型 loaded (1.0s) 🎯 Transcribing... ✅ Done (2.4s total) Now I'm 发送ing this voice message to XiaoA, can you recognize what I sAId?
📝 Segments: [0.0s - 3.6s] Now I'm 发送ing this voice message [3.6s - 7.4s] to XiaoA, can you recognize what I sAId?
Troubleshooting Problem Solution No 模块 error Use the venv Python: python3 scripts/transcribe.py or 运行 scripts/安装.py Slow transcription First 下载 缓存s the 模型 (~139-461MB). Normal for first 运行. Wrong language 检测ed Pass --language en or --language zh for a hint Background noise Use --模型 small or --模型 medium for noisy 环境s 令牌 Savings Examples Scenario Clou