Audio ducking is the technique where background music automatically drops in volume when narration starts and rises again when narration pauses or stops. It is standard practice in broadcast television, radio, and professional video production. In automated video pipelines where nobody is sitting at a mixing console, ducking needs to happen programmatically through audio processing filters.

How Ducking Works Technically

A ducking processor continuously analyzes the narration audio track to detect when speech is present (voice activity detection). When speech is detected -- meaning the narration signal exceeds a configured threshold -- the music track's volume is reduced by a specified amount, typically 10-20 dB. When speech stops and the narration signal drops below the threshold, the music volume fades back up to its normal level over a configurable release time.

The four parameters that control ducking behavior and determine how natural it sounds:

Threshold: The narration volume level that triggers the duck. Typically set between -30 and -40 dBFS. Too sensitive and breathing sounds trigger false ducks. Too insensitive and quiet speech passages play over full-volume music.
Reduction: How much to lower the music volume when narration is active. 10-20 dB is the standard range. Less than 10 dB means music still competes with speech. More than 20 dB creates an unnatural silence-then-music pattern.
Attack time: How quickly the music volume drops when narration starts. 50-200ms is typical. Faster attack ensures music gets out of the way immediately. Slower attack creates a smoother, more cinematic transition.
Release time: How quickly the music volume returns when narration stops. 300-800ms is typical. This is the most critical parameter for natural-sounding results. Too fast and the music snaps back jarringly between sentences. Too slow and the music never fully returns during brief pauses.

FFmpeg Sidechain Ducking Implementation

FFmpeg can perform audio ducking using the sidechaincompress filter, which uses one audio signal (narration) to control the dynamics processing of another signal (music):

ffmpeg -i narration.wav -i background_music.wav \
  -filter_complex \
  "[1:a]asplit=2[music][sc]; \
   [sc]aformat=channel_layouts=mono[scmono]; \
   [music][scmono]sidechaincompress=\
     threshold=0.02:\
     ratio=8:\
     attack=200:\
     release=500:\
     level_sc=0.8[ducked]; \
   [0:a][ducked]amix=inputs=2:duration=longest[out]" \
  -map "[out]" output_mixed.wav

This command splits the music into two copies: one for output and one as a sidechain analysis signal. When narration volume exceeds the threshold, the compressor reduces the music track volume. When narration is silent, the music plays at its configured baseline level.

Getting the Parameters Right for Different Content Types

Content Type	Music (no speech)	Music (during speech)	Attack	Release
Step-by-step tutorial	-20 dBFS	-35 dBFS	100ms	500ms
Documentary style	-18 dBFS	-30 dBFS	200ms	800ms
Fast-paced tech news	-22 dBFS	-38 dBFS	50ms	300ms
Ambient/relaxed tutorial	-16 dBFS	-28 dBFS	300ms	1000ms

Faster attack times suit quick-talking content where you want the music to get out of the way immediately when narration begins. Slower attacks suit calm, documentary-style narration where the gradual fade feels more natural and cinematic.

Common Ducking Mistakes to Avoid

Over-ducking: Reducing music to near-complete silence during speech sounds unnatural and makes the transitions between speech and silence jarring. The music should still be faintly audible during narration -- it provides continuity and warmth.
Fast release on slow-paced content: Music volume snapping back to full level between sentences is distracting. Match the release time to your narration pacing. Slow speakers need slower release times.
Not filtering breath sounds: If your TTS output includes audible breath sounds between sentences, the ducker may trigger on those breaths and keep the music ducked even during pauses. Apply a noise gate to narration audio before feeding it to the ducking processor.
Skipping ducking entirely: Without ducking, music at a comfortable listening volume during pauses will compete with and mask narration during speech. Music quiet enough to never compete is too quiet during silent sections and adds nothing to the production.

Integrating Ducking Into Your Production Pipeline

In an automated pipeline like VidNo, ducking runs as a standard step in the audio assembly stage. The pipeline has the narration track and the music track as separate audio files generated in previous stages. It applies the FFmpeg ducking filter with your configured parameters, producing a single mixed audio track that combines both elements with proper balance. This mixed track then gets combined with the video in the final assembly step.

No manual adjustment required. No mixing board. No audio engineering expertise needed beyond choosing your initial parameter settings. The result is professional audio balance on every video, consistently and reproducibly, without human intervention for each individual video.

Audio Ducking Tool for Video Narration: Automatic Volume Balance

How Ducking Works Technically

FFmpeg Sidechain Ducking Implementation

Stop editing. Start shipping.

Getting the Parameters Right for Different Content Types

Common Ducking Mistakes to Avoid

Integrating Ducking Into Your Production Pipeline

How Ducking Works Technically

FFmpeg Sidechain Ducking Implementation

Stop editing. Start shipping.

Getting the Parameters Right for Different Content Types

Common Ducking Mistakes to Avoid

Integrating Ducking Into Your Production Pipeline

Related Articles

AI Voice Cloner for YouTube Videos: Clone Your Voice Locally and Securely

Clone My Voice for YouTube Content: A Step-by-Step Guide

Text-to-Speech YouTube Video Maker: When TTS Makes Sense and When It Does Not