Converting SRT Files Into Styled Burned-In Video Captions

You have a folder of SRT files from old videos. Maybe a transcription service generated them, maybe YouTube's auto-captions exported them, maybe you wrote them by hand years ago. Now you want to upgrade those plain text subtitles into styled, burned-in captions with modern font, color, and animation treatment without re-transcribing everything from scratch. Here is the complete workflow.

Understanding the Conversion Chain

SRT files contain text and timestamps but absolutely zero styling information. To get from SRT to styled burned-in captions, the conversion chain has three stages:

SRT (text + timing only)
  -- ASS (text + timing + styling + animation)
    -- FFmpeg burn-in (video with styled captions baked into frames)

The middle step -- converting SRT to ASS with full styling -- is where the value gets added. SRT tells you what to say and when. ASS adds how it should look.

Step 1: Parse the SRT

SRT format is straightforward plain text with numbered entries:

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free
1
00:00:01,200 --> 00:00:04,800
This is the first caption line

2
00:00:05,100 --> 00:00:08,300
And this is the second one

Most programming languages have mature SRT parsers available. In Python, pysrt handles it cleanly. In Node.js, subtitle or srt-parser-2 work well. The output is a structured array of entries with start time, end time, and text content.

Step 2: Add Styling via ASS Conversion

The ASS file adds a style definition header and converts each SRT entry into a dialogue line with style references. A basic conversion script in Python demonstrates the process:

import pysrt

subs = pysrt.open('captions.srt')

ass_header = """[Script Info]
PlayResX: 1920
PlayResY: 1080

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, OutlineColour, BorderStyle, Outline, Shadow, Alignment, MarginV
Style: Default,Poppins-Bold,52,&H00FFFFFF,&H00000000,1,4,1,2,120

[Events]
Format: Layer, Start, End, Style, Text
"""

for sub in subs:
    start = sub.start.to_time().strftime("%H:%M:%S.%f")[:-4]
    end = sub.end.to_time().strftime("%H:%M:%S.%f")[:-4]
    print(f"Dialogue: 0,start,end,Default,,0,0,0,,sub.text")

The style definition gives you full control over font, size, color, outline, and positioning. Change the Fontname and PrimaryColour values to match your channel's branding. The Outline value controls outline thickness -- 4px is a solid default for readability.

Step 3: Burn In With FFmpeg

Once you have the ASS file, burning it into the video is a single FFmpeg command:

ffmpeg -i original_video.mp4 -vf "ass=styled_captions.ass"   -c:v libx264 -crf 18 -c:a copy output_with_captions.mp4

Batch Processing Multiple Files

If you have dozens of SRT files to convert, wrap the conversion and burn-in steps in a shell loop that processes your entire library automatically. For each video, the script finds its companion SRT file, converts to ASS with your style applied, and burns the result into the video. A library of 50 old videos can be processed overnight without any manual intervention.

Upgrading to Word-Level Timing

Standard SRT files only have sentence-level timing, which limits you to static sentence captions without word-level highlighting or animation. If you want word-level effects from an existing SRT, you have two practical options:

  • Re-transcribe with Whisper. Run Whisper on the original audio to get word-level timestamps from scratch. You lose nothing from the old SRT because Whisper generates a complete new transcription with finer timing resolution.
  • Force-align the SRT text. Use a forced alignment tool like aeneas or Montreal Forced Aligner that takes the known text from your SRT and aligns it word-by-word against the audio waveform. This preserves your existing text (useful if you manually corrected the SRT) while adding word-level timing data.

The first option is simpler and produces better results for most use cases. VidNo's pipeline always generates word-level timestamps from scratch via Whisper, so importing old SRT files is unnecessary for new productions. But if you have manually corrected SRT files where the text accuracy matters more than the timing granularity, forced alignment preserves your corrections while adding word-level timing.

Quality Checklist

  • Verify the font file exists on the rendering system -- FFmpeg falls back to a default font silently if the specified font is missing, producing ugly results with no error message
  • Preview the first 30 seconds before batch processing the full set to catch styling issues early
  • Check mobile readability -- render a sample and view it on your phone at actual size
  • Confirm audio sync -- SRT timing can drift if the original video was re-encoded at a different frame rate than the original