Setting Up Karaoke-Style Word Highlighting for Video Captions

The karaoke effect -- where each word changes color as it is spoken, sweeping across the text like a highlighting marker -- has become one of the most recognizable caption styles in short-form video. It keeps viewers reading in sync with the speaker and creates a satisfying visual rhythm that static captions cannot match. Here is how to implement it from scratch and how to automate it for production use.

Step 1: Get Word-Level Timestamps

Karaoke highlighting requires knowing exactly when each word starts and ends. Sentence-level timestamps are not sufficient -- you need per-word timing with sub-50ms accuracy for the effect to feel right. Whisper provides this when run with word-level timestamp output:

import whisper

model = whisper.load_model("medium")
result = model.transcribe("audio.wav", word_timestamps=True)

for segment in result["segments"]:
    for word in segment["words"]:
        print(f"word: start - end")

This gives you sub-second precision per word. The accuracy depends heavily on audio quality -- clean studio narration gives alignment within 30-40ms, while audio with background music or environmental noise can drift to 80-100ms. For karaoke-style captions, even 100ms offset is noticeable because the highlight sweep visually conflicts with the audio timing.

Step 2: Generate ASS Subtitle File

The ASS (Advanced SubStation Alpha) format supports karaoke timing natively through the \k tag family. The \kf tag creates a smooth fill effect that sweeps across each word progressively, while \k creates an instant color switch:

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free
Dialogue: 0,0:00:04.32,0:00:05.10,Default,,0,0,0,karaoke,{\kf43}This {\kf38}is {\kf52}a {\kf67}highlighted {\kf44}sentence

The numbers after \kf represent duration in centiseconds (hundredths of a second). So \kf43 means the fill effect takes 430 milliseconds for that word. The fill transitions from the secondary color to the primary color over that duration, creating the sweeping highlight that defines the karaoke look.

Step 3: Style the Karaoke Effect

The style section of your ASS file controls the colors used in the karaoke effect. The key properties:

  • PrimaryColour: The color text transitions TO (the highlighted state) -- typically yellow, cyan, or your brand color
  • SecondaryColour: The color text starts as (the unhighlighted state) -- typically white or light gray
  • OutlineColour: The outline color, typically black for maximum readability over any background

A common and effective combination: white secondary (unhighlighted text), yellow primary (highlighted text), black outline at 3-4px. As the speaker progresses through the sentence, each word sweeps from white to yellow, creating a clear visual indicator of the reading position.

Step 4: Burn Into Video

With the ASS file generated, burning it into the video is a single FFmpeg command:

ffmpeg -i video.mp4 -vf "ass=karaoke.ass" -c:a copy output.mp4

Step 5: Fine-Tune the Display

Several adjustments make the difference between amateur and polished karaoke captions:

  1. Limit words per line. Show 4-6 words at a time maximum. More than that makes the highlight sweep too fast to follow comfortably.
  2. Add breathing room. Insert 50-100ms gaps between word transitions. Without gaps, the highlight feels like a strobe effect that is tiring to watch.
  3. Use contrasting colors. The highlight color must be obviously different from the unhighlighted color. White-to-yellow works perfectly. White-to-light-gray does not provide enough contrast to register as a highlight.
  4. Test on mobile screens. What looks smooth on a desktop monitor can look choppy on a phone at lower playback resolution. Render at 60fps if possible for smoother color transitions.
  5. Match the pace. Fast speakers need faster highlight sweeps (shorter kf values). Slow speakers need longer sweeps. The highlight should always track the natural speaking pace.

Automating the Full Chain

Doing this manually for every video is tedious and error-prone. VidNo automates the full chain: Whisper extracts word timestamps, a script generates the ASS file with karaoke tags calculated from those timestamps and your style config, and FFmpeg burns it in during the final render. You configure the colors and style once in your project file, and every video gets consistent karaoke captions without manual timing work. The entire process adds about 10-15 seconds to a typical render for a 60-second Shorts video on a machine with a decent GPU.