Captions on YouTube Shorts are not optional. Over 70% of Shorts viewers watch with sound off -- swiping through the feed on a bus, in a waiting room, at their desk when they should be working. If your Short has no captions, it has no content for the majority of its viewers. But not all captions are equal. The difference between good captions and bad captions is the difference between a 60% swipe-away rate and a 30% swipe-away rate.
Word-by-Word vs. Sentence-by-Sentence
YouTube auto-generates captions, but they appear as full sentences at the bottom of the screen -- easy to ignore, hard to read while also watching the video content. The caption style that performs on Shorts is word-by-word highlighting: each word appears or illuminates as it is spoken, drawing the eye and creating rhythm.
Think of it like karaoke text. The current word is highlighted (usually in a contrasting color), while the surrounding words are visible but dimmer. This forces the viewer's eye to move with the narration, keeping attention locked on the content. It also compensates for synthesis artifacts in AI-generated speech by letting viewers read along.
Why Timing Accuracy Matters
A caption that appears 200 milliseconds late is annoying. A caption that appears 500 milliseconds late is unwatchable. For word-by-word captions, timing needs to be accurate to within 50 milliseconds per word. At normal speaking speed (150 words per minute), that means a new word highlight every 400 milliseconds, each placed within a 50ms window.
Achieving this requires forced alignment: taking the narration audio and the script text, and mapping each word to its exact position in the audio waveform. Off-the-shelf tools for this include:
- Whisper (OpenAI) -- Provides word-level timestamps with good accuracy. Free and runs locally.
- gentle -- Purpose-built forced aligner. More accurate than Whisper for timing but requires separate installation.
- Aeneas -- Python-based forced aligner. Reliable for pre-generated TTS audio where you have both the text and the audio.
VidNo uses Whisper for word-level timestamp extraction and then adjusts for synthesis artifacts (voice cloning sometimes introduces micro-pauses between words that natural speech does not have). The adjustment step is what separates "pretty close" timing from "perfectly synced" timing.
Rendering Captions with FFmpeg
Once you have word-level timestamps, rendering the captions into the video is an FFmpeg operation. The approach uses the ASS subtitle format, which supports styled text with per-word timing:
# Simplified example of ASS subtitle entries for word-by-word highlighting
Dialogue: 0,0:00:01.20,0:00:01.60,Highlight,,0,0,0,,{c&H00FFFF&}Building
Dialogue: 0,0:00:01.20,0:00:01.60,Normal,,0,0,0,,Building
Dialogue: 0,0:00:01.60,0:00:02.10,Highlight,,0,0,0,,{c&H00FFFF&}a
Dialogue: 0,0:00:01.60,0:00:02.10,Normal,,0,0,0,,a
# ... each word gets two entries: highlighted and normal
The ASS format supports font styling, positioning, color changes, and animation. Word-by-word highlighting is achieved by overlaying a highlighted version of each word on top of the normal text, timed to appear only when that word is being spoken.
Caption Styles That Work for Tech Content
Developer content has unique caption requirements. Technical terms, function names, and code snippets appear in narration and need special handling:
Code terms in monospace
When the narration says "call the validateToken function," the word "validateToken" should appear in a monospace font to signal that it is code, not conversational English. This visual distinction helps viewers follow technical explanations.
Larger font on mobile
Shorts display at full screen on mobile. Captions need to be large enough to read comfortably -- at least 48px on a 1080x1920 canvas. Tiny captions that work on desktop become unreadable on a phone held at arm's length.
Contrasting background
Code editors are dark. Terminal windows are dark. If your captions are white text on a dark Short, they disappear against dark backgrounds. Add a semi-transparent background stripe behind captions to ensure readability regardless of the underlying video content.
Automation at Scale
The word-by-word caption pipeline adds about 15 seconds of processing time per Short. For a batch of 30 Shorts, that is under 8 minutes. The pipeline extracts timing, generates the ASS subtitle file, and burns captions into the video in a single FFmpeg pass. No manual caption editing. No timing adjustments. Just accurate, styled, word-level captions on every Short, every time.