We Measured Caption Timing Accuracy Across Every Major Tool

Caption timing is the invisible quality metric that separates professional-looking captions from amateur ones. When captions appear 200ms late, viewers feel something is off even if they cannot articulate what is wrong. When they appear 400ms late, it is actively distracting and breaks the viewing experience. We tested seven caption generation tools on the same set of audio clips and measured the average offset between spoken words and displayed captions.

Test Methodology

We used 20 audio clips: 10 clean narration recordings (studio quality, single speaker, no background noise) and 10 challenging recordings (background noise, fast speech, multiple speakers, non-native accents). Each clip was 60 seconds long. We ran every tool on every clip and compared the generated word-level timestamps against manually annotated ground truth timestamps created by a professional transcriptionist.

The metric: mean absolute timing error in milliseconds per word. Lower is better. We also tracked maximum error (worst single word) and percentage of missed words.

Results: Clean Audio

Tool	Mean Error (ms)	Max Error (ms)	Words Missed
Whisper large-v3 (local)	38	180	0.2%
Whisper medium (local)	52	240	0.5%
Google Cloud Speech v2	31	150	0.1%
AssemblyAI	35	170	0.3%
Deepgram Nova-2	29	140	0.1%
YouTube Auto-captions	85	450	1.8%
CapCut auto-captions	62	310	0.9%

Results: Challenging Audio

Tool	Mean Error (ms)	Max Error (ms)	Words Missed
Whisper large-v3 (local)	78	380	2.1%
Whisper medium (local)	105	520	3.8%
Google Cloud Speech v2	64	290	1.4%
AssemblyAI	71	340	2.0%
Deepgram Nova-2	58	260	1.2%
YouTube Auto-captions	190	980	6.5%
CapCut auto-captions	142	680	4.2%

Key Findings

Deepgram Nova-2 had the best timing accuracy across both conditions, with a 29ms average on clean audio that is imperceptible to humans. It also had the fastest processing speed at roughly 50x real-time.
Whisper large-v3 was the best free and local option. Its 38ms average error on clean audio is below the human perception threshold. Running locally means no API costs and no data leaving your machine.
YouTube auto-captions were the worst by a large margin. Almost a full second of max error on challenging audio. This alone is reason enough to generate your own captions rather than relying on YouTube's built-in system.
The perceptual threshold is around 80ms. Below that, most viewers cannot detect the offset between speech and caption. Above it, the mismatch becomes noticeable and distracting.

What Causes Timing Errors

Most timing errors cluster around specific audio events rather than being distributed randomly:

Word boundaries in fast speech. When words run together without clear pauses, the model struggles to determine where one ends and the next begins, often splitting the boundary 50-100ms from the actual transition.
Plosive consonants. Hard B, P, and T sounds can confuse onset detection by 50-100ms because the burst of air creates an ambiguous audio event.
Background music transitions. Sudden volume changes in background audio cause the model to misjudge speech timing, sometimes by 200ms or more during the transition.
Sentence-initial words. The first word after a pause is often aligned 30-50ms late because the model waits for enough audio context before committing to a timestamp.

Practical Recommendations

For a local pipeline like VidNo, Whisper large-v3 provides the best balance of cost (free), accuracy (38ms on clean audio), and privacy (nothing leaves your machine). If you need better accuracy on noisy audio and are willing to pay per minute, Deepgram's API is the current accuracy leader. For most developer content -- clean narration over a screen recording with minimal background noise -- Whisper medium is sufficient and runs about twice as fast as large-v3, making it the pragmatic choice for high-volume production.

Auto-Sync Captions to Speech: Timing Accuracy Tested Across Tools

We Measured Caption Timing Accuracy Across Every Major Tool

Test Methodology

Results: Clean Audio

Results: Challenging Audio

Key Findings

What Causes Timing Errors

Stop editing. Start shipping.

Practical Recommendations

We Measured Caption Timing Accuracy Across Every Major Tool

Test Methodology

Results: Clean Audio

Results: Challenging Audio

Key Findings

What Causes Timing Errors

Stop editing. Start shipping.

Practical Recommendations

Related Articles

Word-by-Word Captions for Shorts: Why They Triple Watch Time

Auto Caption YouTube Shorts: Burned-In Subtitles in One Click

Word-by-Word Subtitle Generator: How Animated Captions Actually Work