Self-Hosting VidNo: The Complete Architecture Guide

VidNo runs entirely on your local machine. No cloud services process your video or audio. The only external calls are to the Claude API for script generation and the YouTube Data API for uploading the finished video. Here is the complete architecture: every component, how they connect, and what each one does.

Architecture Overview


Recording.mp4
    |
    v
[Frame Sampler] ──> Sampled frames (PNG)
    |
    v
[Region Detector] ──> Labeled screen regions
    |
    v
[OCR Engine] ──> Extracted text per region
    |                    |
    v                    v
[Git Diff Analyzer]  [AST Parser]
    |                    |
    v                    v
[Context Assembler] ──> Structured context JSON
    |
    v
[Claude API] ──> Generated script (3 versions)
    |
    v
[MOSS TTS] ──> Narration audio (WAV)
    |
    v
[Smart Cut Engine] ──> Edit decision list
    |
    v
[FFmpeg Renderer] ──> 4 output videos (MP4)
    |
    v
[Thumbnail Generator] ──> Thumbnail (PNG)
    |
    v
[YouTube Uploader] ──> Published on YouTube

Component Breakdown

Frame Sampler

Input: Raw MP4 recording
Output: PNG frames at adaptive intervals
Logic: Samples more frequently during active screen changes (typing, terminal output) and less during idle periods. Typical: 1-2 frames per second during activity, 1 frame per 5 seconds during idle.
Technology: FFmpeg for frame extraction, custom Python for adaptive timing

Region Detector

Input: Sampled frames
Output: Bounding boxes with labels (editor, terminal, browser, etc.)
Technology: YOLO-based detection model fine-tuned on developer screen layouts
GPU requirement: Runs on CUDA, ~1GB VRAM

OCR Engine

Input: Labeled screen regions
Output: Extracted text per region with metadata
Technology: Custom model based on PaddleOCR, fine-tuned for monospace fonts and dark themes
Details: See terminal detection deep dive

Git Diff Analyzer

Input: OCR-extracted code at different timestamps
Output: Classified diffs with significance scores
Technology: libgit2 for diff generation, custom classifier for change categorization
Optional: If a git repository is detected, reads actual commits for higher accuracy
Details: See git diff to video script

AST Parser

Input: Extracted code text with language detection
Output: Structured code representation (functions, classes, imports, patterns)
Technology: Tree-sitter for multi-language parsing
Supported languages: JavaScript, TypeScript, Python, Rust, Go, Java, C#, Ruby, PHP, and more via Tree-sitter grammars

Context Assembler

Input: OCR results, diff analysis, AST data
Output: Structured JSON context document for each segment of the recording
Logic: Merges all analysis signals into a coherent timeline of what happened in the coding session

Claude API (only external dependency)

Input: Structured context JSON
Output: Three scripts (full tutorial, quick recap, highlight reel)
What it receives: Code context, diff summaries, and structural information. It does NOT receive your voice data or raw video.
API key: You provide your own Anthropic API key

MOSS TTS

Input: Generated script text + your voice model
Output: WAV audio files
Technology: Local neural TTS, fully offline after model training
GPU requirement: 4-8GB VRAM for inference
Details: See MOSS TTS explained

Smart Cut Engine

Input: Original video, OCR data, audio analysis
Output: Edit decision list (keep/cut/compress for each segment)
Technology: Custom scoring algorithm using screen activity, code significance, and audio classification
Details: See Smart Cut algorithm

FFmpeg Renderer

Input: Edit decision list, narration audio, original video
Output: Four MP4 files (full tutorial, recap, highlight reel, YouTube Short)
Technology: FFmpeg with optional NVENC GPU encoding
Details: See rendering pipeline

Thumbnail Generator

Input: Rendered video frames, script content, code context
Output: PNG thumbnail optimized for YouTube
Logic: Selects a visually compelling frame, applies code-focused overlays and readable text

YouTube Uploader

Input: Rendered MP4 files, thumbnail, generated script metadata
Output: Published YouTube videos with full metadata
Technology: YouTube Data API v3
What it sends: Video files, thumbnail, title, description, tags, chapter timestamps, scheduling info
Authentication: OAuth2 token stored locally at ~/.vidno/youtube-token.json

Data Flow: No Cloud Dependencies

The entire pipeline runs locally with one exception:

Frame sampling: local
Region detection: local (GPU)
OCR: local (GPU)
Diff analysis: local (CPU)
AST parsing: local (CPU)
Script generation: Claude API (sends code context, receives text)
Voice synthesis: local (GPU)
Video editing: local (CPU)
Rendering: local (CPU/GPU)
Thumbnail generation: local (CPU)
YouTube upload: YouTube Data API (sends rendered video, thumbnail, and metadata)

Your source code, voice model, and raw recordings never leave your machine. The only data sent externally is: structured code context (not raw video or audio) sent to Claude for script generation, and the final rendered videos with metadata sent to YouTube for publishing.

System Requirements

OS: Linux (Ubuntu 22.04+ recommended), Windows with WSL2, macOS (CPU-only mode available but significantly slower)
GPU: NVIDIA with 8GB+ VRAM recommended (6GB minimum)
RAM: 16GB minimum, 32GB recommended for long recordings
Storage: 20GB for models and dependencies + space for recordings and output
CPU: 4+ cores for FFmpeg rendering

Installation

git clone https://github.com/vidno-ai/vidno
cd vidno
bash setup.sh  # installs dependencies, downloads models
bash train-voice.sh samples/  # train voice model (one-time)
bash make-video.sh recording.mp4  # process and upload to YouTube

Setup takes approximately 30 minutes (mostly downloading models), plus a one-time vidno youtube connect to authenticate with the YouTube API. After that, the pipeline handles everything from recording to published video. The only external dependencies at runtime are the Claude API for script generation and the YouTube API for uploading.

Self-hosting means you control every aspect of the pipeline. No vendor lock-in, no subscription for processing, no source code leaving your network. The only ongoing cost is Claude API usage for script generation -- typically $0.10-0.50 per video depending on length.

Self-Hosting VidNo: The Complete Architecture Guide

Architecture Overview

Component Breakdown

Stop editing. Start shipping.

Data Flow: No Cloud Dependencies

System Requirements

Installation

Related Articles

Local vs Cloud AI Video Processing: Privacy, Speed, and Cost

VidNo System Requirements: GPU, RAM, and Storage Guide

Which NVIDIA GPU Do You Need for Local AI Video Processing?