I get asked "how does it actually work?" more than any other question about AI video pipelines. So here is the technical walkthrough -- every stage of the pipeline, what happens at each step, and what the data flow looks like from raw .mp4 to published YouTube video.
Stage 1: Ingestion
Input: a screen recording file (typically MP4, H.264 encoded, 1920x1080 or higher).
The pipeline first extracts metadata: duration, frame rate, resolution, audio codec. It then creates two parallel work streams:
- Visual stream: Frames are extracted at 1 FPS for OCR analysis. Full-rate frames are preserved for final rendering.
- Audio stream: The audio track is separated for any speech detection or noise analysis. For developer recordings, this is often minimal -- keyboard sounds, occasional muttering.
Stage 2: Content Analysis
This is the stage that differentiates a smart pipeline from a dumb one.
OCR Processing
Each extracted frame (1 per second) is processed through OCR. The output is a time-indexed stream of text content:
Frame 001 (0:00): "import React from 'react'..."
Frame 002 (0:01): "import React from 'react'..." [no change]
Frame 003 (0:02): "import React from 'react'..." [no change]
Frame 015 (0:14): "function UserProfile({ userId })..." [new content]
Frame 016 (0:15): "function UserProfile({ userId }) {..." [typing]
Change detection identifies when new content appears vs. when the screen is static. This produces a "content activity map" -- a timeline showing where meaningful changes happen.
Git Diff Integration
If git commit data is available (VidNo can watch for commits during recording), the diffs provide ground truth about what changed. OCR catches what is visible; git diffs catch what changed in the file system, including files not on screen.
Event Classification
The combined OCR and diff data feeds an event classifier that labels each time segment:
- CODING: active code writing
- DEBUGGING: error messages visible, followed by code changes
- TESTING: test runner output visible
- CONFIGURING: config files being edited
- IDLE: no meaningful changes
- RESULT: UI output, build success, deployment confirmation
Stage 3: Script Generation
The event timeline and code diffs are sent to Claude API. The prompt is structured:
{
"events": [...], // classified time segments
"diffs": [...], // git diffs per commit
"ocr_highlights": [...] // key text changes detected
}
Claude generates a narration script segmented by event, with timing markers. The script is technically specific -- it names functions, describes patterns, explains reasoning.
Stage 4: Voice Synthesis
The script is fed to a local TTS engine using a cloned voice model. Output: WAV audio with word-level timestamps. This runs entirely on-device -- no cloud TTS service.
Stage 5: Editing and Composition
FFmpeg receives instructions from the content analysis:
- Cut segments classified as IDLE
- Speed-ramp CONFIGURING segments to 2x
- Maintain 1x speed for CODING, DEBUGGING, RESULT segments
- Insert narration audio aligned to remaining segments
- Add transition effects between major section boundaries
- Render chapter markers based on event boundaries
Stage 6: Output Generation
From the edited timeline, the pipeline produces:
- Full video (H.264, AAC audio, 1080p)
- Thumbnail (PNG, 1280x720, content-aware composition)
- YouTube Shorts (H.264, 1080x1920, burned-in captions)
- Metadata file (title, description, tags, chapters in YouTube API format)
Stage 7: Upload
YouTube Data API v3 receives the video file, thumbnail, and metadata in a single resumable upload. The video is set to public or scheduled based on configuration. Shorts are uploaded as separate videos with the #Shorts hashtag.
Performance Optimization
The pipeline is CPU and GPU bound at different stages. OCR is CPU-heavy. Voice synthesis is GPU-heavy. FFmpeg encoding uses both. On a machine with a modern GPU (RTX 3060+) and 8+ CPU cores, the stages overlap well -- OCR runs on CPU while the previous segment's voice synthesis runs on GPU. The result is roughly linear scaling: a 40-minute recording takes about 5-8 minutes. An 80-minute recording takes 10-15 minutes. If you are running on a machine without a discrete GPU, voice synthesis falls back to CPU and the total time roughly doubles.
Total pipeline time for a 40-minute recording: 5-8 minutes, depending on GPU speed for voice synthesis and FFmpeg encoding.