Voice TTS
Local GPU-powered real-time text-to-speech pipeline with a WebSocket API, designed to voice AI agents that stream text in chunks.
Features
- Streaming input: accepts partial text as it is generated by the LLM/agent.
- Streaming output: returns PCM audio chunks over WebSocket as soon as they are synthesized.
- Voice cloning: single speaker cloned from reference audio, with optional per-emotion references.
- Interrupt / stop: agent can immediately stop playback when the user interrupts the AI.
- Emotion control: switch emotion on the fly (requires matching reference audio or supported backend).
- Local GPU: runs entirely on your NVIDIA GPU (RTX 3090 / 3060 compatible).
Project status
- Working WebSocket server with streaming text, audio streaming, and instant stop/resume.
- F5-TTS backend installed, GPU-ready, and producing real audio (
models/F5TTS_v1_Base/ downloaded).
- Dummy backend available for fast offline tests.
- Startup warm-up caches the default reference and primes CUDA.
- Next: multilingual evaluation, latency optimization, and client examples.
Quick start
# Create virtual environment (Python 3.10-3.12 recommended)
python3.11 -m venv .venv
source .venv/bin/activate
# Install PyTorch with CUDA 12.6 support first
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126
# Install remaining dependencies
pip install -r requirements.txt
# (Optional) Download the F5-TTS model beforehand
python scripts/download_f5_tts.py --model F5TTS_v1_Base
# Run the server
python -m voice_tts.main
# Or run in dummy test mode
TTS_BACKEND=dummy python -m voice_tts.main
Server will listen on ws://localhost:8765/ws.
WebSocket protocol
See full documentation in docs/03_websocket_protocol.md.
Architecture and roadmap