@Eugene Sukhodolskiy Eugene Sukhodolskiy authored 12 days ago
docs feat(examples): add Python and browser WebSocket clients 12 days ago
examples fix(browser-client): robust AudioContext resume and add test tone button 12 days ago
scripts Initial voice TTS server with F5-TTS backend, WebSocket streaming and warm-up 12 days ago
src/ voice_tts fix(server): use default voice_ref from settings when client omits it 12 days ago
tests Initial voice TTS server with F5-TTS backend, WebSocket streaming and warm-up 12 days ago
voices Initial voice TTS server with F5-TTS backend, WebSocket streaming and warm-up 12 days ago
.env.example chore(config): enable default voice ref, ref text and model name in .env.example 12 days ago
.gitignore Initial voice TTS server with F5-TTS backend, WebSocket streaming and warm-up 12 days ago
README.md Initial voice TTS server with F5-TTS backend, WebSocket streaming and warm-up 12 days ago
pyproject.toml feat(examples): add Python and browser WebSocket clients 12 days ago
requirements.txt feat(examples): add Python and browser WebSocket clients 12 days ago
README.md

Voice TTS

Local GPU-powered real-time text-to-speech pipeline with a WebSocket API, designed to voice AI agents that stream text in chunks.

Features

  • Streaming input: accepts partial text as it is generated by the LLM/agent.
  • Streaming output: returns PCM audio chunks over WebSocket as soon as they are synthesized.
  • Voice cloning: single speaker cloned from reference audio, with optional per-emotion references.
  • Interrupt / stop: agent can immediately stop playback when the user interrupts the AI.
  • Emotion control: switch emotion on the fly (requires matching reference audio or supported backend).
  • Local GPU: runs entirely on your NVIDIA GPU (RTX 3090 / 3060 compatible).

Project status

  • Working WebSocket server with streaming text, audio streaming, and instant stop/resume.
  • F5-TTS backend installed, GPU-ready, and producing real audio (models/F5TTS_v1_Base/ downloaded).
  • Dummy backend available for fast offline tests.
  • Startup warm-up caches the default reference and primes CUDA.
  • Next: multilingual evaluation, latency optimization, and client examples.

Quick start

# Create virtual environment (Python 3.10-3.12 recommended)
python3.11 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA 12.6 support first
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install remaining dependencies
pip install -r requirements.txt

# (Optional) Download the F5-TTS model beforehand
python scripts/download_f5_tts.py --model F5TTS_v1_Base

# Run the server
python -m voice_tts.main

# Or run in dummy test mode
TTS_BACKEND=dummy python -m voice_tts.main

Server will listen on ws://localhost:8765/ws.

WebSocket protocol

See full documentation in docs/03_websocket_protocol.md.

Architecture and roadmap