VidNo uses MOSS TTS, a local text-to-speech model that runs entirely on your GPU, to clone your voice and generate narration for your videos. The process starts with a 60-second voice sample and produces a personal voice model that captures not just how you sound, but how you speak.
During setup, you run vidno voice-setup and read a provided paragraph out loud. The paragraph is specifically designed to capture a wide range of phonemes, intonation patterns, and speech rhythms. Speak naturally — the model works best when it learns your actual conversational tone rather than a stiff "recording voice."
MOSS TTS analyzes your sample to extract several layers of vocal characteristics. The obvious ones are pitch, timbre, and tone — what makes your voice sound like you rather than someone else. But it also captures speaking rhythm: where you tend to pause, how you handle emphasis, your pace when explaining something versus when you are running through setup steps. This rhythm layer is what separates VidNo's voice output from generic TTS that sounds robotic even when it mimics your timbre.
The resulting voice model is a set of weights stored locally on your machine at ~/.vidno/voices/. The model file is typically 200-400MB depending on your GPU's precision settings. It never gets uploaded anywhere.
When VidNo generates narration for a video, it feeds the script text through your voice model on your GPU. The synthesis happens in chunks — sentence by sentence — so VidNo can apply appropriate pauses between sections and adjust pacing based on the content type. Code explanations get slightly slower pacing; transitional phrases move quicker.
Voice quality depends on three factors: GPU capability (more VRAM allows higher-fidelity synthesis), sample quality (quiet room, clear speech, no background noise), and script quality (natural-sounding sentences produce more natural-sounding speech). On an RTX 4070 or above with a clean voice sample, most listeners cannot distinguish the AI narration from a real human recording.