Two AI Systems, One Pipeline
Automated video scripting and voiceover are often discussed together, but they are fundamentally different problems solved by different models. The scripting system needs to understand code, context, and pedagogy. The voiceover system needs to produce natural-sounding speech from text. Connecting them well is what separates a useful pipeline from a gimmick.
The Scripting Pipeline
Script generation from a screen recording requires three inputs:
- OCR transcript -- what appeared on screen, timestamped
- Git diff data -- what code actually changed during the session (if available)
- Recording metadata -- duration, resolution, detected applications
These inputs feed a language model with a carefully structured prompt. The prompt matters enormously. A bad prompt produces scripts that read like documentation. A good prompt produces scripts that sound like a knowledgeable developer explaining something to a colleague.
Prompt Engineering for Video Scripts
Key constraints that improve output quality:
You are writing a voiceover script for a coding tutorial.
Rules:
- Use contractions (it's, we'll, don't)
- Maximum 15 words per sentence
- Refer to "we" not "I" or "you"
- After explaining a concept, add [PAUSE 1.5s]
- Never say "as you can see" -- the viewer can see
- Include timestamp markers matching the OCR data
- Target 150 words per minute of video
The 150 words-per-minute target is critical. YouTube tutorials with narration faster than 170 WPM see measurably higher drop-off rates. Slower than 130 WPM and viewers get bored. The script length needs to match the video duration after editing.
The Voiceover Pipeline
Once the script exists, the voiceover system takes over. Modern TTS for developer content typically uses one of these approaches:
| Approach | Latency | Quality | Privacy |
|---|---|---|---|
| Cloud TTS (ElevenLabs, Play.ht) | Fast | Excellent | Low -- audio sent to servers |
| Local XTTS v2 | Moderate | Very Good | High -- runs on your GPU |
| Local Piper TTS | Very Fast | Good | High -- CPU only |
| Local F5-TTS | Moderate | Excellent | High -- requires NVIDIA GPU |
For developers recording proprietary code, the privacy column matters most. Sending your screen recording content to a cloud API means your codebase is visible to a third party. Local TTS eliminates that risk entirely.
Voice Cloning vs. Stock Voices
Stock voices are generic. They work, but they sound like every other AI-narrated video on YouTube. Voice cloning trains a model on your own speech patterns -- typically 10-30 minutes of clean audio -- and produces output that sounds like you. Viewers who know your voice from live streams or conference talks will recognize it.
The training process is straightforward:
- Record yourself reading diverse text for 15-30 minutes in a quiet room
- Clean the audio -- remove background noise, normalize levels
- Train the voice model (takes 30-60 minutes on a modern GPU)
- The model file is typically 50-200MB and runs locally from that point forward
Synchronizing Script to Screen
The hardest part of the double pipeline is synchronization. The voiceover needs to match what is happening on screen. If the narration says "now we install the dependencies" but the terminal shows a completely different command, the video feels broken.
The solution is timestamp-aware script generation. The script includes markers tied to specific moments in the recording:
[00:00:05] "We start by creating a new project directory."
[00:00:12] "Next, we initialize the package.json file."
[00:00:20] [PAUSE 2s] -- wait for npm init to complete
[00:00:25] "With the project initialized, we can add our dependencies."
VidNo runs both pipelines locally -- Claude API for script generation and a local TTS engine for voiceover. The synchronization layer uses the OCR timestamps to align narration with on-screen activity, producing output where the voice and the visuals stay in lockstep throughout the video.