This is the complete reference architecture for a YouTube automation stack. Every layer, every integration point, every data flow documented. Bookmark this and reference it when building or debugging your pipeline. No fluff -- just the technical spec.

Layer 0: Infrastructure

The foundation everything runs on. Get this wrong and everything above it is unreliable:

  • Processing machine: Linux (Ubuntu 22.04+ recommended), 8+ CPU cores, 16+ GB RAM, 500 GB SSD minimum
  • Node.js 20 LTS -- Runtime for pipeline orchestration and queue management
  • FFmpeg 6.x -- Video and audio processing engine, compiled with libx264 and libmp3lame support
  • PM2 -- Process manager that keeps background services running and auto-restarts on failure
  • SQLite -- Local database for queue state, metadata storage, and processing history
  • Tesseract 5.x -- OCR engine for frame text extraction

Layer 1: Capture

Tool: OBS Studio 30.x
Input: Your screen activity
Output: MP4 files saved to /recordings/raw/
Trigger: Manual recording (you press record and stop)
Config: 1080p, 30fps, CRF 18, MP4 container, x264 encoder

Layer 2: Analysis

Tool: Tesseract OCR + custom Node.js frame sampler
Input: MP4 from Layer 1
Output: JSON analysis document containing:
  - Frame-by-frame text content at 1-second intervals
  - Git diff data (if code repository changes detected)
  - Scene change timestamps based on visual diff thresholds
  - Active region coordinates showing where cursor and text changes occur
Trigger: File watcher detects new MP4 in input directory

Layer 3: Scripting

Tool: Claude API (claude-3.5-sonnet or later model)
Input: Analysis JSON from Layer 2
Output: Timestamped narration script containing:
  - One narration segment per detected scene/section
  - Technical terms verified against the on-screen context
  - Hook line optimized for viewer retention in first 5 seconds
  - Section markers for chapter timestamp generation
Trigger: Analysis complete signal from Layer 2

Layer 4: Voice

Tool: ElevenLabs API or local voice clone service
Input: Script text segments with timing metadata
Output: WAV audio files per segment + combined full track
  - Sample rate: 44.1kHz
  - Bit depth: 16-bit
  - Loudness normalized to -16 LUFS
  - Background music mixed with auto-ducking applied
Trigger: Script generation complete from Layer 3

Layer 5: Assembly

Tool: FFmpeg 6.x scripted via Node.js child processes
Input: Original MP4 + voice audio track + script timing metadata
Output: Final production video file
  - Dead frames trimmed (frames with no visual change removed)
  - Narration synced to corresponding visual segments
  - Text overlays at moments referenced in the script
  - Background music ducked under narration automatically
  - Encoded: H.264 video, AAC audio, 1080p, target 8Mbps
Trigger: Voice generation complete from Layer 4

Layer 6: Metadata

Tool: Claude API + keyword research database
Input: Script content + niche-specific keyword volume data
Output: Complete metadata bundle:
  - Title optimized for search CTR (under 60 characters)
  - Description with summary paragraph and timestamps
  - Tags: 15-20 relevant search terms
  - YouTube category ID for the content vertical
  - Thumbnail: 3 variants generated from key frames with text
Trigger: Runs in parallel with Layer 5 for efficiency

Layer 7: Publishing

Tool: YouTube Data API v3 with OAuth 2.0 authentication
Input: Video file + metadata bundle from Layers 5 and 6
Output: Published or scheduled YouTube video
  - Resumable upload protocol for reliability
  - Custom thumbnail set after upload processing completes
  - Publish time scheduled per content calendar
  - Video added to topic-appropriate playlist
  - First comment pinned with supplementary links
Trigger: Both Assembly and Metadata layers complete

Layer 8: Analytics Feedback Loop

Tool: YouTube Analytics API + custom analysis dashboard
Input: Video performance data pulled 48 hours post-publish
Output: Performance report per video containing:
  - CTR compared to channel rolling average
  - Audience retention curve with drop-off analysis
  - Traffic source breakdown (search vs browse vs suggested)
  - Keyword ranking positions for target search terms
Trigger: Cron job running daily at 9 AM local time

The Orchestration Layer

VidNo serves as the orchestration layer connecting all eight layers into a cohesive system. It manages the processing queue, routes data between stages with correct formatting, handles errors with retry and fallback logic, and provides a single monitoring interface for the entire stack. Without an orchestrator, you maintain eight separate tools with custom glue code between each pair -- a maintenance burden that scales poorly as you add capabilities or change components.

Data Flow Summary

Screen Recording (manual)
  |--[OBS]--> MP4 file
  |--[Tesseract]--> Analysis JSON
  |--[Claude API]--> Narration Script
  |--[ElevenLabs]--> Audio WAV track
  |--[FFmpeg]--> Final MP4 video
  |--[Claude API]--> Metadata bundle
  |--[YT Data API]--> Published/Scheduled Video
  |--[YT Analytics API]--> Performance Data
  '--> Performance data feeds back into topic selection

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free