Burning Captions During the Render Pass Instead of After
Most caption workflows treat captioning as a post-processing step that happens after the video is already rendered. You render your video, save it to disk, then run it through a separate captioning tool which loads the file, adds captions, re-encodes the entire video, and saves a new file. This means your video gets compressed twice: once during the initial render and once during the caption burn-in pass. Double encoding means measurable quality loss and double the total render time.
The Single-Pass Approach
A smarter architecture generates captions and burns them in during the same FFmpeg render that produces the final video. Instead of the two-pass workflow:
render video -- save file -- load file -- transcribe -- burn captions -- save new file
You get a single-pass workflow:
transcribe audio -- generate ASS file -- render video with captions in one FFmpeg command
The transcription step still needs to happen before the render starts because you need the complete audio waveform to generate accurate word-level timestamps. But the actual caption burning happens during the only encoding pass, not as a separate re-encode step afterward. The video frames are written to the output file exactly once, with captions composited onto them during that single write.
Why Single-Pass Matters
- No quality loss from re-encoding. Each encoding pass through lossy codecs like H.264 introduces generational quality loss. The degradation is subtle but visible on text, sharp edges, and fine details -- exactly the kind of content in developer screen recordings where code text needs to remain crisp and readable.
- 40-50% faster total processing time. You skip an entire decode-encode cycle. On a 10-minute 1080p video, this saves 2-4 minutes depending on your hardware and codec settings. Over hundreds of videos, the time savings compounds significantly.
- Simpler pipeline architecture. One FFmpeg command instead of two. Fewer temporary files on disk, fewer failure points in the pipeline, less disk space used during processing.
- Smaller final file. Single-pass encoding at a given CRF value produces a smaller file than two passes at the same CRF because the encoder does not have to re-compress already-compressed artifacts.
Implementation
The key is separating the transcription timeline from the render timeline. Transcription needs the audio, which you can extract early in the pipeline before any video processing begins:
- Extract audio from raw recording:
ffmpeg -i recording.mp4 -vn -acodec pcm_s16le audio.wav - Run Whisper on the extracted audio to get word-level timestamps (can run in parallel with other pre-render tasks)
- Generate ASS subtitle file from timestamps combined with your style configuration
- Render final video with captions in one pass:
ffmpeg -i recording.mp4 -vf "ass=captions.ass" -c:v libx264 -crf 18 output.mp4
The critical detail: the ASS file must be fully generated before the FFmpeg render command starts. This is not truly "real-time" in the sense of live captioning during recording. It is real-time in the sense that captions are composited during the render rather than in a separate pass afterward.
Parallel Transcription for Maximum Speed
For maximum pipeline throughput, you can transcribe the audio in parallel with other pre-render processing tasks like thumbnail generation, metadata creation, and script formatting. By the time the pipeline reaches the FFmpeg render step, the ASS file is already generated and waiting. The render step simply includes it as a video filter argument.
VidNo structures its pipeline exactly this way. Audio extraction, Whisper transcription, and ASS generation all happen in the pre-render phase. The FFmpeg render command includes the ASS subtitle filter, so captions are composited as part of the single encoding pass. There is no post-processing caption step and no unnecessary re-encoding of already-rendered video frames.
Limitations
This approach assumes you have the complete audio track available before rendering begins. For live streaming or real-time screen recording that goes directly to output, you cannot pre-transcribe because the audio does not exist in complete form yet. In those edge cases, a post-processing caption pass remains unavoidable. But for the vast majority of YouTube content -- pre-recorded, edited, and rendered from source files -- single-pass captioning is strictly superior in quality, speed, and simplicity.
Every re-encode is a quality tax on your video. If you can avoid paying it by restructuring your pipeline, you should.