MOSS TTS: The Local Text-to-Speech Engine Behind VidNo
MOSS (Modular Open Speech Synthesis) is the text-to-speech engine that powers VidNo's voice cloning and narration pipeline. It runs entirely on your local GPU, generates speech that sounds like you, and processes narration at 2-5x real-time speed. No cloud services, no API calls, no data leaving your machine.
What MOSS Is
MOSS is a neural TTS system designed specifically for developer workflows. Unlike general-purpose TTS engines optimized for assistant-style speech (short, conversational phrases), MOSS is optimized for:
- Long-form narration: Tutorial narration runs 5-20 minutes. MOSS handles extended generation without quality degradation or repetition artifacts that plague some TTS models on long sequences.
- Technical vocabulary: Function names, framework terms, error messages, CLI commands. MOSS is trained on developer content and correctly pronounces
kubectl,nginx,sudo, and hundreds of other technical terms that generic TTS systems mangle. - Consistent pacing: Code tutorials require a measured, explanatory pace. MOSS generates speech with consistent cadence suitable for educational content, avoiding the overly expressive prosody that sounds wrong for technical material.
- Voice cloning: Train on 20-30 minutes of your voice and MOSS generates narration that sounds like you, not a stock AI voice.
How It Works
The MOSS pipeline has three stages:
- Text processing: The input script is broken into sentences, with special handling for code elements (inline code, function names, file paths). Technical terms are mapped to their phonetic representations using a developer-specific pronunciation dictionary.
- Acoustic model: A neural network converts the processed text into a mel spectrogram -- a visual representation of the audio's frequency content over time. This is where the voice character lives: your cloned voice model produces spectrograms that match your vocal characteristics.
- Vocoder: A second neural network (the vocoder) converts the mel spectrogram into actual audio waveforms. This is what produces the final high-fidelity audio file.
Voice Cloning Process
Training a voice clone with MOSS:
- Record samples: Provide 20-30 minutes of your voice reading provided reference text. Clean audio from a decent microphone in a quiet room.
- Fine-tune the model: MOSS fine-tunes its base model on your voice data. This adapts the acoustic model to match your vocal characteristics -- pitch, timbre, speaking rhythm, and accent.
- Save the model: The fine-tuned model is saved as a local file (typically 200-500MB). This is your personal voice model.
- Generate: Point MOSS at your voice model and provide any text. It generates audio in your voice.
Training takes 15-30 minutes on a mid-range GPU. You do this once, and the model works indefinitely.
Quality Benchmarks
MOSS performance on standard TTS evaluation metrics:
- Mean Opinion Score (MOS): 4.1/5.0 for voice-cloned output (human speech baseline is ~4.5, cloud services average 4.2-4.4)
- Word Error Rate: Less than 1% on technical developer vocabulary (compared to 3-5% for generic TTS systems on the same content)
- Speaker similarity: 0.87 cosine similarity between cloned and original voice (above 0.85 is generally indistinguishable in blind tests for non-expert listeners)
GPU Requirements
| GPU Tier | VRAM | Training Time | Generation Speed |
|---|---|---|---|
| Minimum (RTX 3060) | 6GB | ~30 min | ~2x real-time |
| Recommended (RTX 4070) | 12GB | ~15 min | ~4x real-time |
| Optimal (RTX 4090) | 24GB | ~8 min | ~8x real-time |
"Real-time" means: a 10-minute narration at 4x real-time generates in 2.5 minutes.
MOSS runs on NVIDIA GPUs only due to CUDA dependency. AMD and Intel GPU support is not available.
Compared to Cloud TTS
| Feature | MOSS (Local) | Cloud TTS Services |
|---|---|---|
| Privacy | Complete -- nothing leaves your machine | Voice data and text sent to cloud |
| Cost per generation | Electricity only (~$0.01-0.05 per video) | $0.50-5.00 per video (varies by length) |
| Quality (technical content) | Excellent -- trained for developer vocabulary | Good to excellent, varies by service |
| Latency | 2-5 min for 10 min narration | 30 sec - 2 min (faster for short content) |
| Voice cloning | Full local training | Upload voice samples to cloud |
| Offline capability | Fully offline after model training | Requires internet |
| Ongoing cost | None (one-time GPU investment) | Per-character or per-minute pricing |
Integration With VidNo
MOSS is integrated directly into VidNo's pipeline. When you run bash make-video.sh recording.mp4, the script generation step produces text, and MOSS immediately converts that text to narration using your voice model. The audio is then synchronized with the video during the rendering pipeline step. No manual intervention needed.
For developers who want to use MOSS independently outside of VidNo, it is available as a standalone CLI tool: moss generate --voice your-model.pt --text script.txt --output narration.wav.