python -m voice_tts.mainTTS_BACKEND=dummy python -m voice_tts.mainTTS_BACKEND=xtts_v2 python -m voice_tts.mainvoice-ttscurl http://localhost:8765/healthcd examples && python -m http.server 8080 → открыть http://localhost:8080/client_browser.htmlTTS_BACKEND=dummy python -m voice_tts.main + http-сервер из examples/scripts/ — standalone utilities (benchmark, download) src/voice_tts/ — package entry points main.py — uvicorn.run app (the console-script target) config.py — pydantic-settings (Settings class); .env is auto-loaded here api/server.py — FastAPI + WebSocket session loop; _create_engine() picks backend by TTS_BACKEND env var api/protocol.py — Pydantic msg models for /ws protocol session/state.py — SessionState, VoiceProfile tts/engine.py — TTSEngine ABC, DummyTTSEngine tts/fish_speech_backend.py — Fish Speech 1.5 implementation tts/f5_backend.py — F5-TTS v1 implementation tts/xtts_backend.py — XTTS-v2 implementation (auto-downloads from Coqui) tts/segmenter.py — sentence-break + comma fallback segmentation tts/utils.py — preprocess_text_for_tts() audio/formats.py — float32→PCM16→base64, WAV header generation tests/ — pytest files models/ — local model checkpoints (gitignored) voices/ — reference audio (wavs/flac); .wav files gitignored but .lab files are kept and used by Fish Speech
Python 3.10–3.12 is required (set in pyproject.toml). PyTorch must be installed with CUDA support before other deps:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126 pip install -r requirements.txt
All settings live in config.py; the Settings class auto-loads from .env via pydantic-settings.
Key variables:
| Variable | Default | Notes |
|---|---|---|
TTS_BACKEND |
fish_speech |
One of: dummy / f5_tts / xtts_v2 / fish_speech. Switching backends requires a clean restart (engine is built lazily on first connection). |
TTS_MODEL_PATH |
— | Fish Speech checkpoint folder (contains model.pth, firefly-gan-vq-fsq-8x1024-21hz-generator.pth, tokenizer.tiktoken, config.json) |
TTS_VOCAB_PATH |
— | Fish Speech v1.5 source tree path (used to import firefly_gan / FSQ modules) |
TTS_MODEL_NAME |
tts_models/multilingual/multi-dataset/xtts_v2 |
Coqui model manager path; xtts_v2 downloads this on first use |
FISH_COMPILE |
false |
Avoid setting to true. Enables torch.compile but causes CUDAGraphs tensor-overwrite errors on repeated inference. |
FISH_CHUNK_LENGTH |
200 | Chunk length for Fish Speech (100–300). Higher = more GPU work per call, higher latency. |
/ws)ws://localhost:8765/wsinit, text, flush, stop, emotion, configstatus (session_ready / segment_started / stopped / config_updated), audio (sample_rate + base64 data), plus error messages on failure.pytest tests/ # asyncio_mode = auto, paths in tests/
Fixtures and reference audio live in tests/. No external services required — dummy backend works for unit-level tests without GPU. Fish Speech backends need the local checkpoint in models/fishaudio_fish-speech-1.5/ (gitignored).
scripts/benchmark_backends.py — compare inference times across backendsscripts/download_f5_tts.py — downloads F5-TTS v1 model files into models/F5TTS_v1_Base/scripts/benchmark_compile.py — torch.compile benchmarking utility/ws connection in _create_engine() inside api/server.py. Changing TTS_BACKEND requires a full server restart, not just a message-level config change._synth_lock. Concurrent sessions share a single inference thread — this exists to avoid CUDA contention and OOM on multi-gpu setups..env is gitignored but .env.example is the source of truth for supported variables. config.py line 50 sets env_file = ".env"._sync_synthesize in server.py:291), which means if your test modifies global asyncio state it can break other tests — run tests independently or set asyncio_mode=auto._handle_text (server.py:152–157), a space is automatically inserted between consecutive payloads if neither side has whitespace at the join point. This prevents word merging when clients send word-by-word without trailing spaces (e.g. the browser client). Clients should not include leading/trailing spaces in payloads — the server handles spacing.