Newer
Older
voice / README.md

Voice TTS

Local GPU-powered real-time text-to-speech pipeline with a WebSocket API, designed to voice AI agents that stream text in chunks.

Features

  • Streaming input: accepts partial text as it is generated by the LLM/agent.
  • Streaming output: returns PCM audio chunks over WebSocket as soon as they are synthesized.
  • Voice cloning: single speaker cloned from reference audio, with optional per-emotion references.
  • Interrupt / stop: agent can immediately stop playback when the user interrupts the AI.
  • Emotion control: switch emotion on the fly (requires matching reference audio or supported backend).
  • Local GPU: runs entirely on your NVIDIA GPU (RTX 3090 / 3060 compatible).

Project status

  • Working WebSocket server with streaming text, audio streaming, and instant stop/resume.
  • F5-TTS backend installed, GPU-ready, and producing real audio (models/F5TTS_v1_Base/ downloaded).
  • Dummy backend available for fast offline tests.
  • Startup warm-up caches the default reference and primes CUDA.
  • Next: multilingual evaluation, latency optimization, and client examples.

Quick start

# Create virtual environment (Python 3.10-3.12 recommended)
python3.11 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA 12.6 support first
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install remaining dependencies
pip install -r requirements.txt

# (Optional) Download the F5-TTS model beforehand
python scripts/download_f5_tts.py --model F5TTS_v1_Base

# Run the server
python -m voice_tts.main

# Or run in dummy test mode
TTS_BACKEND=dummy python -m voice_tts.main

Server will listen on ws://localhost:8765/ws.

WebSocket protocol

See full documentation in docs/03_websocket_protocol.md.

Architecture and roadmap