VidNo's voice cloning takes a 60-second sample of your voice and produces a model that narrates all your future videos. The process runs locally on your GPU. Your voice data stays on your machine.
What You Need
- A quiet room (closet works great, your car is fine too)
- Any microphone -- laptop mic works, USB mic is better
- 60 seconds of natural speech
- An NVIDIA GPU with 12+ GB VRAM
You do not need a professional microphone or a treated studio. The model extracts vocal characteristics (pitch, cadence, emphasis patterns) rather than recording quality. A $30 USB mic in a quiet room produces a voice clone indistinguishable from a $300 mic in a booth.
Step 1: Record Your Sample
VidNo includes a built-in recording utility:
vidno voice record
This opens a simple recorder that captures 60 seconds of audio. Talk naturally. Do not read a script -- just explain something technical you know well. The model learns better from natural speech patterns than from read-aloud text.
Good sample content:
- Explain how a tool you use daily works
- Walk through a recent debugging session from memory
- Describe your development setup and why you chose each tool
Bad sample content:
- Reading a blog post aloud (too monotone, unnatural cadence)
- Reciting a script (loses your natural speech rhythm)
- Speaking in a different register than you normally use (the model learns whatever you give it)
Step 2: Train the Model
vidno voice train
# Output:
# Processing sample... ████████████████ 100%
# Extracting vocal features...
# Training voice model...
#
# Voice profile saved: ~/.vidno/voices/default.bin
# Training time: 45 seconds
# Quality score: 94/100
Training takes 30-90 seconds depending on your GPU. The quality score reflects how well the model captured your vocal characteristics. Anything above 80 produces natural-sounding output. Below 70, re-record in a quieter environment.
Step 3: Test It
vidno voice test "This is a test of my cloned voice.
Let me explain how React hooks work under the hood."
This generates a short audio clip. Listen for:
- Pitch accuracy: Does it sound like you, or a robotic version of you?
- Cadence: Does it pause and emphasize like you do?
- Technical terms: Does it pronounce framework names, language features, and acronyms correctly?
Multiple Voice Profiles
You can create multiple voice profiles for different contexts:
# Create a named profile
vidno voice record --name tutorial-voice
vidno voice train --name tutorial-voice
# Use a specific profile
vidno process recording.mp4 --voice tutorial-voice
# List all profiles
vidno voice list
This is useful for teams where multiple developers create content, or if you want different tones for different content types (casual for shorts, more measured for long tutorials).
Improving Voice Quality
If your first attempt does not sound right:
- Background noise: Record in a quieter space. The model can handle some noise, but silence between words is where it picks up room characteristics.
- Speaking style: Talk like you would on a video call, not like you are giving a keynote. Conversational delivery clones better.
- Sample length: While 60 seconds is the minimum, you can provide up to 5 minutes of audio. More data means better results, especially for unusual vocal patterns.
- Multiple samples: You can train on multiple recordings. Run
vidno voice recordseveral times, thenvidno voice train --all.
Privacy and Data
Voice cloning runs entirely on your local GPU. The voice model file (~/.vidno/voices/*.bin) is a mathematical representation of your vocal characteristics, not an audio recording. It cannot be reverse-engineered into your original voice sample.
No voice data is sent to any server. This is fundamentally different from cloud-based voice cloning services, which store your voice on their infrastructure. With VidNo, your voice model lives on your disk and nowhere else.
For the technical details on local vs cloud voice processing, see local vs cloud processing.