We Measured Caption Timing Accuracy Across Every Major Tool

Caption timing is the invisible quality metric that separates professional-looking captions from amateur ones. When captions appear 200ms late, viewers feel something is off even if they cannot articulate what is wrong. When they appear 400ms late, it is actively distracting and breaks the viewing experience. We tested seven caption generation tools on the same set of audio clips and measured the average offset between spoken words and displayed captions.

Test Methodology

We used 20 audio clips: 10 clean narration recordings (studio quality, single speaker, no background noise) and 10 challenging recordings (background noise, fast speech, multiple speakers, non-native accents). Each clip was 60 seconds long. We ran every tool on every clip and compared the generated word-level timestamps against manually annotated ground truth timestamps created by a professional transcriptionist.

The metric: mean absolute timing error in milliseconds per word. Lower is better. We also tracked maximum error (worst single word) and percentage of missed words.

Results: Clean Audio

ToolMean Error (ms)Max Error (ms)Words Missed
Whisper large-v3 (local)381800.2%
Whisper medium (local)522400.5%
Google Cloud Speech v2311500.1%
AssemblyAI351700.3%
Deepgram Nova-2291400.1%
YouTube Auto-captions854501.8%
CapCut auto-captions623100.9%

Results: Challenging Audio

ToolMean Error (ms)Max Error (ms)Words Missed
Whisper large-v3 (local)783802.1%
Whisper medium (local)1055203.8%
Google Cloud Speech v2642901.4%
AssemblyAI713402.0%
Deepgram Nova-2582601.2%
YouTube Auto-captions1909806.5%
CapCut auto-captions1426804.2%

Key Findings

  • Deepgram Nova-2 had the best timing accuracy across both conditions, with a 29ms average on clean audio that is imperceptible to humans. It also had the fastest processing speed at roughly 50x real-time.
  • Whisper large-v3 was the best free and local option. Its 38ms average error on clean audio is below the human perception threshold. Running locally means no API costs and no data leaving your machine.
  • YouTube auto-captions were the worst by a large margin. Almost a full second of max error on challenging audio. This alone is reason enough to generate your own captions rather than relying on YouTube's built-in system.
  • The perceptual threshold is around 80ms. Below that, most viewers cannot detect the offset between speech and caption. Above it, the mismatch becomes noticeable and distracting.

What Causes Timing Errors

Most timing errors cluster around specific audio events rather than being distributed randomly:

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free
  1. Word boundaries in fast speech. When words run together without clear pauses, the model struggles to determine where one ends and the next begins, often splitting the boundary 50-100ms from the actual transition.
  2. Plosive consonants. Hard B, P, and T sounds can confuse onset detection by 50-100ms because the burst of air creates an ambiguous audio event.
  3. Background music transitions. Sudden volume changes in background audio cause the model to misjudge speech timing, sometimes by 200ms or more during the transition.
  4. Sentence-initial words. The first word after a pause is often aligned 30-50ms late because the model waits for enough audio context before committing to a timestamp.

Practical Recommendations

For a local pipeline like VidNo, Whisper large-v3 provides the best balance of cost (free), accuracy (38ms on clean audio), and privacy (nothing leaves your machine). If you need better accuracy on noisy audio and are willing to pay per minute, Deepgram's API is the current accuracy leader. For most developer content -- clean narration over a screen recording with minimal background noise -- Whisper medium is sufficient and runs about twice as fast as large-v3, making it the pragmatic choice for high-volume production.