Voice TTS

Local GPU-powered real-time text-to-speech pipeline with a WebSocket API, designed to voice AI agents that stream text in chunks.

Features

Streaming input: accepts partial text as it is generated by the LLM/agent.
Streaming output: returns PCM audio chunks over WebSocket as soon as they are synthesized.
Voice cloning: single speaker cloned from reference audio, with optional per-emotion references.
Interrupt / stop: agent can immediately stop playback when the user interrupts the AI.
Emotion control: switch emotion on the fly (requires matching reference audio or supported backend).
Local GPU: runs entirely on your NVIDIA GPU (RTX 3090 / 3060 compatible).

Project status

Working WebSocket server with streaming text, audio streaming, and instant stop/resume.
F5-TTS backend installed, GPU-ready, and producing real audio (models/F5TTS_v1_Base/ downloaded).
Dummy backend available for fast offline tests.
Startup warm-up caches the default reference and primes CUDA.
Next: multilingual evaluation, latency optimization, and client examples.

Quick start

# Create virtual environment (Python 3.10-3.12 recommended)
python3.11 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA 12.6 support first
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install remaining dependencies
pip install -r requirements.txt

# (Optional) Download the F5-TTS model beforehand
python scripts/download_f5_tts.py --model F5TTS_v1_Base

# Run the server
python -m voice_tts.main

# Or run in dummy test mode
TTS_BACKEND=dummy python -m voice_tts.main

Server will listen on ws://localhost:8765/ws.

WebSocket protocol

See full documentation in docs/03_websocket_protocol.md.

Voice TTS

Features

Project status

Quick start

WebSocket protocol

Architecture and roadmap