I get asked "how does it actually work?" more than any other question about AI video pipelines. So here is the technical walkthrough -- every stage of the pipeline, what happens at each step, and what the data flow looks like from raw .mp4 to published YouTube video.

Stage 1: Ingestion

Input: a screen recording file (typically MP4, H.264 encoded, 1920x1080 or higher).

The pipeline first extracts metadata: duration, frame rate, resolution, audio codec. It then creates two parallel work streams:

Visual stream: Frames are extracted at 1 FPS for OCR analysis. Full-rate frames are preserved for final rendering.
Audio stream: The audio track is separated for any speech detection or noise analysis. For developer recordings, this is often minimal -- keyboard sounds, occasional muttering.

Stage 2: Content Analysis

This is the stage that differentiates a smart pipeline from a dumb one.

OCR Processing

Each extracted frame (1 per second) is processed through OCR. The output is a time-indexed stream of text content:

Frame 001 (0:00): "import React from 'react'..."
Frame 002 (0:01): "import React from 'react'..." [no change]
Frame 003 (0:02): "import React from 'react'..." [no change]
Frame 015 (0:14): "function UserProfile({ userId })..." [new content]
Frame 016 (0:15): "function UserProfile({ userId }) {..." [typing]

Change detection identifies when new content appears vs. when the screen is static. This produces a "content activity map" -- a timeline showing where meaningful changes happen.

Git Diff Integration

If git commit data is available (VidNo can watch for commits during recording), the diffs provide ground truth about what changed. OCR catches what is visible; git diffs catch what changed in the file system, including files not on screen.

Event Classification

The combined OCR and diff data feeds an event classifier that labels each time segment:

CODING: active code writing
DEBUGGING: error messages visible, followed by code changes
TESTING: test runner output visible
CONFIGURING: config files being edited
IDLE: no meaningful changes
RESULT: UI output, build success, deployment confirmation

Stage 3: Script Generation

The event timeline and code diffs are sent to Claude API. The prompt is structured:

{
  "events": [...],        // classified time segments
  "diffs": [...],         // git diffs per commit
  "ocr_highlights": [...] // key text changes detected
}

Claude generates a narration script segmented by event, with timing markers. The script is technically specific -- it names functions, describes patterns, explains reasoning.

Stage 4: Voice Synthesis

The script is fed to a local TTS engine using a cloned voice model. Output: WAV audio with word-level timestamps. This runs entirely on-device -- no cloud TTS service.

Stage 5: Editing and Composition

FFmpeg receives instructions from the content analysis:

Cut segments classified as IDLE
Speed-ramp CONFIGURING segments to 2x
Maintain 1x speed for CODING, DEBUGGING, RESULT segments
Insert narration audio aligned to remaining segments
Add transition effects between major section boundaries
Render chapter markers based on event boundaries

Stage 6: Output Generation

From the edited timeline, the pipeline produces:

Full video (H.264, AAC audio, 1080p)
Thumbnail (PNG, 1280x720, content-aware composition)
YouTube Shorts (H.264, 1080x1920, burned-in captions)
Metadata file (title, description, tags, chapters in YouTube API format)

Stage 7: Upload

YouTube Data API v3 receives the video file, thumbnail, and metadata in a single resumable upload. The video is set to public or scheduled based on configuration. Shorts are uploaded as separate videos with the #Shorts hashtag.

Performance Optimization

The pipeline is CPU and GPU bound at different stages. OCR is CPU-heavy. Voice synthesis is GPU-heavy. FFmpeg encoding uses both. On a machine with a modern GPU (RTX 3060+) and 8+ CPU cores, the stages overlap well -- OCR runs on CPU while the previous segment's voice synthesis runs on GPU. The result is roughly linear scaling: a 40-minute recording takes about 5-8 minutes. An 80-minute recording takes 10-15 minutes. If you are running on a machine without a discrete GPU, voice synthesis falls back to CPU and the total time roughly doubles.

Total pipeline time for a 40-minute recording: 5-8 minutes, depending on GPU speed for voice synthesis and FFmpeg encoding.

Screen Capture to Polished Video AI: What the Pipeline Looks Like

Stage 1: Ingestion

Stage 2: Content Analysis

Stop editing. Start shipping.

OCR Processing

Git Diff Integration

Event Classification

Stage 3: Script Generation

Stage 4: Voice Synthesis

Stage 5: Editing and Composition

Stage 6: Output Generation

Stage 7: Upload

Performance Optimization

Stage 1: Ingestion

Stage 2: Content Analysis

Stop editing. Start shipping.

OCR Processing

Git Diff Integration

Event Classification

Stage 3: Script Generation

Stage 4: Voice Synthesis

Stage 5: Editing and Composition

Stage 6: Output Generation

Stage 7: Upload

Performance Optimization

Related Articles

Custom Thumbnail Generator API: Build Thumbnails Into Your Pipeline

AI Video Maker for Coding Tutorials: Why Developers Need Specialized Tools

Screen Recording to Edited Video: The Developer's Shortcut