Self-Hosting VidNo: The Complete Architecture Guide

VidNo runs entirely on your local machine. No cloud services process your video or audio. The only external calls are to the Claude API for script generation and the YouTube Data API for uploading the finished video. Here is the complete architecture: every component, how they connect, and what each one does.

Architecture Overview


Recording.mp4
    |
    v
[Frame Sampler] ──> Sampled frames (PNG)
    |
    v
[Region Detector] ──> Labeled screen regions
    |
    v
[OCR Engine] ──> Extracted text per region
    |                    |
    v                    v
[Git Diff Analyzer]  [AST Parser]
    |                    |
    v                    v
[Context Assembler] ──> Structured context JSON
    |
    v
[Claude API] ──> Generated script (3 versions)
    |
    v
[MOSS TTS] ──> Narration audio (WAV)
    |
    v
[Smart Cut Engine] ──> Edit decision list
    |
    v
[FFmpeg Renderer] ──> 4 output videos (MP4)
    |
    v
[Thumbnail Generator] ──> Thumbnail (PNG)
    |
    v
[YouTube Uploader] ──> Published on YouTube

Component Breakdown

Frame Sampler

  • Input: Raw MP4 recording
  • Output: PNG frames at adaptive intervals
  • Logic: Samples more frequently during active screen changes (typing, terminal output) and less during idle periods. Typical: 1-2 frames per second during activity, 1 frame per 5 seconds during idle.
  • Technology: FFmpeg for frame extraction, custom Python for adaptive timing

Region Detector

  • Input: Sampled frames
  • Output: Bounding boxes with labels (editor, terminal, browser, etc.)
  • Technology: YOLO-based detection model fine-tuned on developer screen layouts
  • GPU requirement: Runs on CUDA, ~1GB VRAM

OCR Engine

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free
  • Input: Labeled screen regions
  • Output: Extracted text per region with metadata
  • Technology: Custom model based on PaddleOCR, fine-tuned for monospace fonts and dark themes
  • Details: See terminal detection deep dive

Git Diff Analyzer

  • Input: OCR-extracted code at different timestamps
  • Output: Classified diffs with significance scores
  • Technology: libgit2 for diff generation, custom classifier for change categorization
  • Optional: If a git repository is detected, reads actual commits for higher accuracy
  • Details: See git diff to video script

AST Parser

  • Input: Extracted code text with language detection
  • Output: Structured code representation (functions, classes, imports, patterns)
  • Technology: Tree-sitter for multi-language parsing
  • Supported languages: JavaScript, TypeScript, Python, Rust, Go, Java, C#, Ruby, PHP, and more via Tree-sitter grammars

Context Assembler

  • Input: OCR results, diff analysis, AST data
  • Output: Structured JSON context document for each segment of the recording
  • Logic: Merges all analysis signals into a coherent timeline of what happened in the coding session

Claude API (only external dependency)

  • Input: Structured context JSON
  • Output: Three scripts (full tutorial, quick recap, highlight reel)
  • What it receives: Code context, diff summaries, and structural information. It does NOT receive your voice data or raw video.
  • API key: You provide your own Anthropic API key

MOSS TTS

  • Input: Generated script text + your voice model
  • Output: WAV audio files
  • Technology: Local neural TTS, fully offline after model training
  • GPU requirement: 4-8GB VRAM for inference
  • Details: See MOSS TTS explained

Smart Cut Engine

  • Input: Original video, OCR data, audio analysis
  • Output: Edit decision list (keep/cut/compress for each segment)
  • Technology: Custom scoring algorithm using screen activity, code significance, and audio classification
  • Details: See Smart Cut algorithm

FFmpeg Renderer

  • Input: Edit decision list, narration audio, original video
  • Output: Four MP4 files (full tutorial, recap, highlight reel, YouTube Short)
  • Technology: FFmpeg with optional NVENC GPU encoding
  • Details: See rendering pipeline

Thumbnail Generator

  • Input: Rendered video frames, script content, code context
  • Output: PNG thumbnail optimized for YouTube
  • Logic: Selects a visually compelling frame, applies code-focused overlays and readable text

YouTube Uploader

  • Input: Rendered MP4 files, thumbnail, generated script metadata
  • Output: Published YouTube videos with full metadata
  • Technology: YouTube Data API v3
  • What it sends: Video files, thumbnail, title, description, tags, chapter timestamps, scheduling info
  • Authentication: OAuth2 token stored locally at ~/.vidno/youtube-token.json

Data Flow: No Cloud Dependencies

The entire pipeline runs locally with one exception:

  • Frame sampling: local
  • Region detection: local (GPU)
  • OCR: local (GPU)
  • Diff analysis: local (CPU)
  • AST parsing: local (CPU)
  • Script generation: Claude API (sends code context, receives text)
  • Voice synthesis: local (GPU)
  • Video editing: local (CPU)
  • Rendering: local (CPU/GPU)
  • Thumbnail generation: local (CPU)
  • YouTube upload: YouTube Data API (sends rendered video, thumbnail, and metadata)

Your source code, voice model, and raw recordings never leave your machine. The only data sent externally is: structured code context (not raw video or audio) sent to Claude for script generation, and the final rendered videos with metadata sent to YouTube for publishing.

System Requirements

  • OS: Linux (Ubuntu 22.04+ recommended), Windows with WSL2, macOS (CPU-only mode available but significantly slower)
  • GPU: NVIDIA with 8GB+ VRAM recommended (6GB minimum)
  • RAM: 16GB minimum, 32GB recommended for long recordings
  • Storage: 20GB for models and dependencies + space for recordings and output
  • CPU: 4+ cores for FFmpeg rendering

Installation

git clone https://github.com/vidno-ai/vidno
cd vidno
bash setup.sh  # installs dependencies, downloads models
bash train-voice.sh samples/  # train voice model (one-time)
bash make-video.sh recording.mp4  # process and upload to YouTube

Setup takes approximately 30 minutes (mostly downloading models), plus a one-time vidno youtube connect to authenticate with the YouTube API. After that, the pipeline handles everything from recording to published video. The only external dependencies at runtime are the Claude API for script generation and the YouTube API for uploading.

Self-hosting means you control every aspect of the pipeline. No vendor lock-in, no subscription for processing, no source code leaving your network. The only ongoing cost is Claude API usage for script generation -- typically $0.10-0.50 per video depending on length.