Converting SRT Files Into Styled Burned-In Video Captions
You have a folder of SRT files from old videos. Maybe a transcription service generated them, maybe YouTube's auto-captions exported them, maybe you wrote them by hand years ago. Now you want to upgrade those plain text subtitles into styled, burned-in captions with modern font, color, and animation treatment without re-transcribing everything from scratch. Here is the complete workflow.
Understanding the Conversion Chain
SRT files contain text and timestamps but absolutely zero styling information. To get from SRT to styled burned-in captions, the conversion chain has three stages:
SRT (text + timing only)
-- ASS (text + timing + styling + animation)
-- FFmpeg burn-in (video with styled captions baked into frames)
The middle step -- converting SRT to ASS with full styling -- is where the value gets added. SRT tells you what to say and when. ASS adds how it should look.
Step 1: Parse the SRT
SRT format is straightforward plain text with numbered entries:
1
00:00:01,200 --> 00:00:04,800
This is the first caption line
2
00:00:05,100 --> 00:00:08,300
And this is the second one
Most programming languages have mature SRT parsers available. In Python, pysrt handles it cleanly. In Node.js, subtitle or srt-parser-2 work well. The output is a structured array of entries with start time, end time, and text content.
Step 2: Add Styling via ASS Conversion
The ASS file adds a style definition header and converts each SRT entry into a dialogue line with style references. A basic conversion script in Python demonstrates the process:
import pysrt
subs = pysrt.open('captions.srt')
ass_header = """[Script Info]
PlayResX: 1920
PlayResY: 1080
[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, OutlineColour, BorderStyle, Outline, Shadow, Alignment, MarginV
Style: Default,Poppins-Bold,52,&H00FFFFFF,&H00000000,1,4,1,2,120
[Events]
Format: Layer, Start, End, Style, Text
"""
for sub in subs:
start = sub.start.to_time().strftime("%H:%M:%S.%f")[:-4]
end = sub.end.to_time().strftime("%H:%M:%S.%f")[:-4]
print(f"Dialogue: 0,start,end,Default,,0,0,0,,sub.text")
The style definition gives you full control over font, size, color, outline, and positioning. Change the Fontname and PrimaryColour values to match your channel's branding. The Outline value controls outline thickness -- 4px is a solid default for readability.
Step 3: Burn In With FFmpeg
Once you have the ASS file, burning it into the video is a single FFmpeg command:
ffmpeg -i original_video.mp4 -vf "ass=styled_captions.ass" -c:v libx264 -crf 18 -c:a copy output_with_captions.mp4
Batch Processing Multiple Files
If you have dozens of SRT files to convert, wrap the conversion and burn-in steps in a shell loop that processes your entire library automatically. For each video, the script finds its companion SRT file, converts to ASS with your style applied, and burns the result into the video. A library of 50 old videos can be processed overnight without any manual intervention.
Upgrading to Word-Level Timing
Standard SRT files only have sentence-level timing, which limits you to static sentence captions without word-level highlighting or animation. If you want word-level effects from an existing SRT, you have two practical options:
- Re-transcribe with Whisper. Run Whisper on the original audio to get word-level timestamps from scratch. You lose nothing from the old SRT because Whisper generates a complete new transcription with finer timing resolution.
- Force-align the SRT text. Use a forced alignment tool like
aeneasor Montreal Forced Aligner that takes the known text from your SRT and aligns it word-by-word against the audio waveform. This preserves your existing text (useful if you manually corrected the SRT) while adding word-level timing data.
The first option is simpler and produces better results for most use cases. VidNo's pipeline always generates word-level timestamps from scratch via Whisper, so importing old SRT files is unnecessary for new productions. But if you have manually corrected SRT files where the text accuracy matters more than the timing granularity, forced alignment preserves your corrections while adding word-level timing.
Quality Checklist
- Verify the font file exists on the rendering system -- FFmpeg falls back to a default font silently if the specified font is missing, producing ugly results with no error message
- Preview the first 30 seconds before batch processing the full set to catch styling issues early
- Check mobile readability -- render a sample and view it on your phone at actual size
- Confirm audio sync -- SRT timing can drift if the original video was re-encoded at a different frame rate than the original