This is the complete reference architecture for a YouTube automation stack. Every layer, every integration point, every data flow documented. Bookmark this and reference it when building or debugging your pipeline. No fluff -- just the technical spec.
Layer 0: Infrastructure
The foundation everything runs on. Get this wrong and everything above it is unreliable:
- Processing machine: Linux (Ubuntu 22.04+ recommended), 8+ CPU cores, 16+ GB RAM, 500 GB SSD minimum
- Node.js 20 LTS -- Runtime for pipeline orchestration and queue management
- FFmpeg 6.x -- Video and audio processing engine, compiled with libx264 and libmp3lame support
- PM2 -- Process manager that keeps background services running and auto-restarts on failure
- SQLite -- Local database for queue state, metadata storage, and processing history
- Tesseract 5.x -- OCR engine for frame text extraction
Layer 1: Capture
Tool: OBS Studio 30.x
Input: Your screen activity
Output: MP4 files saved to /recordings/raw/
Trigger: Manual recording (you press record and stop)
Config: 1080p, 30fps, CRF 18, MP4 container, x264 encoder
Layer 2: Analysis
Tool: Tesseract OCR + custom Node.js frame sampler
Input: MP4 from Layer 1
Output: JSON analysis document containing:
- Frame-by-frame text content at 1-second intervals
- Git diff data (if code repository changes detected)
- Scene change timestamps based on visual diff thresholds
- Active region coordinates showing where cursor and text changes occur
Trigger: File watcher detects new MP4 in input directory
Layer 3: Scripting
Tool: Claude API (claude-3.5-sonnet or later model)
Input: Analysis JSON from Layer 2
Output: Timestamped narration script containing:
- One narration segment per detected scene/section
- Technical terms verified against the on-screen context
- Hook line optimized for viewer retention in first 5 seconds
- Section markers for chapter timestamp generation
Trigger: Analysis complete signal from Layer 2
Layer 4: Voice
Tool: ElevenLabs API or local voice clone service
Input: Script text segments with timing metadata
Output: WAV audio files per segment + combined full track
- Sample rate: 44.1kHz
- Bit depth: 16-bit
- Loudness normalized to -16 LUFS
- Background music mixed with auto-ducking applied
Trigger: Script generation complete from Layer 3
Layer 5: Assembly
Tool: FFmpeg 6.x scripted via Node.js child processes
Input: Original MP4 + voice audio track + script timing metadata
Output: Final production video file
- Dead frames trimmed (frames with no visual change removed)
- Narration synced to corresponding visual segments
- Text overlays at moments referenced in the script
- Background music ducked under narration automatically
- Encoded: H.264 video, AAC audio, 1080p, target 8Mbps
Trigger: Voice generation complete from Layer 4
Layer 6: Metadata
Tool: Claude API + keyword research database
Input: Script content + niche-specific keyword volume data
Output: Complete metadata bundle:
- Title optimized for search CTR (under 60 characters)
- Description with summary paragraph and timestamps
- Tags: 15-20 relevant search terms
- YouTube category ID for the content vertical
- Thumbnail: 3 variants generated from key frames with text
Trigger: Runs in parallel with Layer 5 for efficiency
Layer 7: Publishing
Tool: YouTube Data API v3 with OAuth 2.0 authentication
Input: Video file + metadata bundle from Layers 5 and 6
Output: Published or scheduled YouTube video
- Resumable upload protocol for reliability
- Custom thumbnail set after upload processing completes
- Publish time scheduled per content calendar
- Video added to topic-appropriate playlist
- First comment pinned with supplementary links
Trigger: Both Assembly and Metadata layers complete
Layer 8: Analytics Feedback Loop
Tool: YouTube Analytics API + custom analysis dashboard
Input: Video performance data pulled 48 hours post-publish
Output: Performance report per video containing:
- CTR compared to channel rolling average
- Audience retention curve with drop-off analysis
- Traffic source breakdown (search vs browse vs suggested)
- Keyword ranking positions for target search terms
Trigger: Cron job running daily at 9 AM local time
The Orchestration Layer
VidNo serves as the orchestration layer connecting all eight layers into a cohesive system. It manages the processing queue, routes data between stages with correct formatting, handles errors with retry and fallback logic, and provides a single monitoring interface for the entire stack. Without an orchestrator, you maintain eight separate tools with custom glue code between each pair -- a maintenance burden that scales poorly as you add capabilities or change components.
Data Flow Summary
Screen Recording (manual)
|--[OBS]--> MP4 file
|--[Tesseract]--> Analysis JSON
|--[Claude API]--> Narration Script
|--[ElevenLabs]--> Audio WAV track
|--[FFmpeg]--> Final MP4 video
|--[Claude API]--> Metadata bundle
|--[YT Data API]--> Published/Scheduled Video
|--[YT Analytics API]--> Performance Data
'--> Performance data feeds back into topic selection