Audio ducking is the technique where background music automatically drops in volume when narration starts and rises again when narration pauses or stops. It is standard practice in broadcast television, radio, and professional video production. In automated video pipelines where nobody is sitting at a mixing console, ducking needs to happen programmatically through audio processing filters.
How Ducking Works Technically
A ducking processor continuously analyzes the narration audio track to detect when speech is present (voice activity detection). When speech is detected -- meaning the narration signal exceeds a configured threshold -- the music track's volume is reduced by a specified amount, typically 10-20 dB. When speech stops and the narration signal drops below the threshold, the music volume fades back up to its normal level over a configurable release time.
The four parameters that control ducking behavior and determine how natural it sounds:
- Threshold: The narration volume level that triggers the duck. Typically set between -30 and -40 dBFS. Too sensitive and breathing sounds trigger false ducks. Too insensitive and quiet speech passages play over full-volume music.
- Reduction: How much to lower the music volume when narration is active. 10-20 dB is the standard range. Less than 10 dB means music still competes with speech. More than 20 dB creates an unnatural silence-then-music pattern.
- Attack time: How quickly the music volume drops when narration starts. 50-200ms is typical. Faster attack ensures music gets out of the way immediately. Slower attack creates a smoother, more cinematic transition.
- Release time: How quickly the music volume returns when narration stops. 300-800ms is typical. This is the most critical parameter for natural-sounding results. Too fast and the music snaps back jarringly between sentences. Too slow and the music never fully returns during brief pauses.
FFmpeg Sidechain Ducking Implementation
FFmpeg can perform audio ducking using the sidechaincompress filter, which uses one audio signal (narration) to control the dynamics processing of another signal (music):
ffmpeg -i narration.wav -i background_music.wav \
-filter_complex \
"[1:a]asplit=2[music][sc]; \
[sc]aformat=channel_layouts=mono[scmono]; \
[music][scmono]sidechaincompress=\
threshold=0.02:\
ratio=8:\
attack=200:\
release=500:\
level_sc=0.8[ducked]; \
[0:a][ducked]amix=inputs=2:duration=longest[out]" \
-map "[out]" output_mixed.wav
This command splits the music into two copies: one for output and one as a sidechain analysis signal. When narration volume exceeds the threshold, the compressor reduces the music track volume. When narration is silent, the music plays at its configured baseline level.
Getting the Parameters Right for Different Content Types
| Content Type | Music (no speech) | Music (during speech) | Attack | Release |
|---|---|---|---|---|
| Step-by-step tutorial | -20 dBFS | -35 dBFS | 100ms | 500ms |
| Documentary style | -18 dBFS | -30 dBFS | 200ms | 800ms |
| Fast-paced tech news | -22 dBFS | -38 dBFS | 50ms | 300ms |
| Ambient/relaxed tutorial | -16 dBFS | -28 dBFS | 300ms | 1000ms |
Faster attack times suit quick-talking content where you want the music to get out of the way immediately when narration begins. Slower attacks suit calm, documentary-style narration where the gradual fade feels more natural and cinematic.
Common Ducking Mistakes to Avoid
- Over-ducking: Reducing music to near-complete silence during speech sounds unnatural and makes the transitions between speech and silence jarring. The music should still be faintly audible during narration -- it provides continuity and warmth.
- Fast release on slow-paced content: Music volume snapping back to full level between sentences is distracting. Match the release time to your narration pacing. Slow speakers need slower release times.
- Not filtering breath sounds: If your TTS output includes audible breath sounds between sentences, the ducker may trigger on those breaths and keep the music ducked even during pauses. Apply a noise gate to narration audio before feeding it to the ducking processor.
- Skipping ducking entirely: Without ducking, music at a comfortable listening volume during pauses will compete with and mask narration during speech. Music quiet enough to never compete is too quiet during silent sections and adds nothing to the production.
Integrating Ducking Into Your Production Pipeline
In an automated pipeline like VidNo, ducking runs as a standard step in the audio assembly stage. The pipeline has the narration track and the music track as separate audio files generated in previous stages. It applies the FFmpeg ducking filter with your configured parameters, producing a single mixed audio track that combines both elements with proper balance. This mixed track then gets combined with the video in the final assembly step.
No manual adjustment required. No mixing board. No audio engineering expertise needed beyond choosing your initial parameter settings. The result is professional audio balance on every video, consistently and reproducibly, without human intervention for each individual video.