voice/AGENTS.md at 2bff5aa1f671efb665fe1455e279c788b65ea29f

General

Communicate with the user in Russian. All explanations, reasoning, and feedback should be written in Russian unless explicitly asked otherwise.

Quick Commands

Run server (Fish Speech default): python -m voice_tts.main
Dummy backend for fast local tests: TTS_BACKEND=dummy python -m voice_tts.main
XTTS-v2 backend: TTS_BACKEND=xtts_v2 python -m voice_tts.main
Console script (installed): voice-tts
Health check: curl http://localhost:8765/health
Browser test client: cd examples && python -m http.server 8080 → открыть http://localhost:8080/client_browser.html
Browser test (dummy): TTS_BACKEND=dummy python -m voice_tts.main + http-сервер из examples/

Project Layout

scripts/           — standalone utilities (benchmark, download)
src/voice_tts/     — package entry points
  main.py          — uvicorn.run app (the console-script target)
  config.py        — pydantic-settings (Settings class); .env is auto-loaded here
  api/server.py    — FastAPI + WebSocket session loop; _create_engine() picks backend by TTS_BACKEND env var
  api/protocol.py  — Pydantic msg models for /ws protocol
  session/state.py — SessionState, VoiceProfile
  tts/engine.py    — TTSEngine ABC, DummyTTSEngine
  tts/fish_speech_backend.py — Fish Speech 1.5 implementation
  tts/f5_backend.py        — F5-TTS v1 implementation
  tts/xtts_backend.py      — XTTS-v2 implementation (auto-downloads from Coqui)
  tts/segmenter.py         — sentence-break + comma fallback segmentation
  tts/utils.py              — preprocess_text_for_tts()
  audio/formats.py          — float32→PCM16→base64, WAV header generation
tests/               — pytest files
models/              — local model checkpoints (gitignored)
voices/              — reference audio (wavs/flac); .wav files gitignored but .lab files are kept and used by Fish Speech

Python & Dependencies

Python 3.10–3.12 is required (set in pyproject.toml). PyTorch must be installed with CUDA support before other deps:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

Configuration

All settings live in config.py; the Settings class auto-loads from .env via pydantic-settings.

Key variables:

Variable	Default	Notes
`TTS_BACKEND`	`fish_speech`	One of: dummy / f5_tts / xtts_v2 / fish_speech. Switching backends requires a clean restart (engine is built lazily on first connection).
`TTS_MODEL_PATH`	—	Fish Speech checkpoint folder (contains model.pth, firefly-gan-vq-fsq-8x1024-21hz-generator.pth, tokenizer.tiktoken, config.json)
`TTS_VOCAB_PATH`	—	Fish Speech v1.5 source tree path (used to import firefly_gan / FSQ modules)
`TTS_MODEL_NAME`	`tts_models/multilingual/multi-dataset/xtts_v2`	Coqui model manager path; xtts_v2 downloads this on first use
`FISH_COMPILE`	`false`	Avoid setting to true. Enables torch.compile but causes CUDAGraphs tensor-overwrite errors on repeated inference.
`FISH_CHUNK_LENGTH`	200	Chunk length for Fish Speech (100–300). Higher = more GPU work per call, higher latency.

WebSocket Protocol (`/ws`)

Server at ws://localhost:8765/ws
Messages are JSON, client-sent types: init, text, flush, stop, emotion, config
Server sends back: status (session_ready / segment_started / stopped / config_updated), audio (sample_rate + base64 data), plus error messages on failure.

Testing

pytest tests/        # asyncio_mode = auto, paths in tests/

Fixtures and reference audio live in tests/. No external services required — dummy backend works for unit-level tests without GPU. Fish Speech backends need the local checkpoint in models/fishaudio_fish-speech-1.5/ (gitignored).

Scripts

scripts/benchmark_backends.py — compare inference times across backends
scripts/download_f5_tts.py — downloads F5-TTS v1 model files into models/F5TTS_v1_Base/
scripts/benchmark_compile.py — torch.compile benchmarking utility

Important Gotchas

Engine is built lazily on first /ws connection in _create_engine() inside api/server.py. Changing TTS_BACKEND requires a full server restart, not just a message-level config change.
All GPU calls are serialized through one _synth_lock. Concurrent sessions share a single inference thread — this exists to avoid CUDA contention and OOM on multi-gpu setups.
.env is gitignored but .env.example is the source of truth for supported variables. config.py line 50 sets env_file = ".env".
The dummy backend runs via a transient event loop (see _sync_synthesize in server.py:291), which means if your test modifies global asyncio state it can break other tests — run tests independently or set asyncio_mode=auto.
Space insertion between text payloads. In _handle_text (server.py:152–157), a space is automatically inserted between consecutive payloads if neither side has whitespace at the join point. This prevents word merging when clients send word-by-word without trailing spaces (e.g. the browser client). Clients should not include leading/trailing spaces in payloads — the server handles spacing.