Home/FAQ/Voice & Audio
Voice & Audio

How does voice cloning work in VidNo?

VidNo uses MOSS TTS, a local text-to-speech model that runs entirely on your GPU, to clone your voice and generate narration for your videos. The process starts with a 60-second voice sample and produces a personal voice model that captures not just how you sound, but how you speak.

During setup, you run vidno voice-setup and read a provided paragraph out loud. The paragraph is specifically designed to capture a wide range of phonemes, intonation patterns, and speech rhythms. Speak naturally — the model works best when it learns your actual conversational tone rather than a stiff "recording voice."

MOSS TTS analyzes your sample to extract several layers of vocal characteristics. The obvious ones are pitch, timbre, and tone — what makes your voice sound like you rather than someone else. But it also captures speaking rhythm: where you tend to pause, how you handle emphasis, your pace when explaining something versus when you are running through setup steps. This rhythm layer is what separates VidNo's voice output from generic TTS that sounds robotic even when it mimics your timbre.

The resulting voice model is a set of weights stored locally on your machine at ~/.vidno/voices/. The model file is typically 200-400MB depending on your GPU's precision settings. It never gets uploaded anywhere.

When VidNo generates narration for a video, it feeds the script text through your voice model on your GPU. The synthesis happens in chunks — sentence by sentence — so VidNo can apply appropriate pauses between sections and adjust pacing based on the content type. Code explanations get slightly slower pacing; transitional phrases move quicker.

Voice quality depends on three factors: GPU capability (more VRAM allows higher-fidelity synthesis), sample quality (quiet room, clear speech, no background noise), and script quality (natural-sounding sentences produce more natural-sounding speech). On an RTX 4070 or above with a clean voice sample, most listeners cannot distinguish the AI narration from a real human recording.

Related Questions

How good is the AI voice quality?Voice & AudioCan I use my own voice or do I have to use AI?Voice & AudioCan I create multiple voice profiles?Voice & Audio

Learn More

/blog/voice-cloning-guide/blog/moss-tts-explained

Ready to try VidNo?

One command turns your screen recordings into scripted, narrated, edited YouTube videos — thumbnailed, uploaded, and published. You code, VidNo handles everything else.

Get Started Free
← Back to all FAQ