Volume Ducking Without Manual Keyframes
Background music in tutorials serves one purpose: fill the silence without competing with narration. When the voiceover is active, the music should drop to near-inaudible levels. When the voiceover pauses, the music should gently rise to prevent dead silence. This behavior is called "ducking," and doing it manually means placing volume keyframes throughout your entire timeline.
For a 15-minute video with frequent narration changes, that is hundreds of keyframes. AI audio ducking automates this entirely.
How Automatic Ducking Works
The ducking algorithm is straightforward in concept:
- Detect speech segments in the voiceover track
- When speech is active, reduce music volume to a target level (typically -20 to -24 dB below narration)
- When speech stops, raise music volume to the "ambient" level (typically -12 to -16 dB below the narration's active level)
- Apply smooth fades (200-500ms attack, 500-1000ms release) to prevent jarring volume changes
FFmpeg implements this with the sidechaincompress filter:
ffmpeg -i narration.wav -i music.mp3 -filter_complex "[1:a]volume=0.15[music];[0:a][music]sidechaincompress=threshold=0.02:ratio=8:attack=200:release=800[out]" -map "[out]" mixed.wav
The threshold controls how quiet the narration needs to be before the music rises. The ratio controls how aggressively the music is reduced during speech. The attack and release values control fade speeds.
Choosing the Right Music
Not all background music works for developer tutorials. The music needs to:
- Be instrumentally simple -- no vocals, no complex melodies that compete for attention
- Have a consistent energy level -- no dramatic builds or drops that distract from the narration
- Loop cleanly -- tutorial videos vary in length, so the music needs to tile seamlessly
- Be royalty-free -- YouTube's Content ID system will claim or mute your video if the music is copyrighted
Music Categories That Work
| Category | Best For | Example Genres |
|---|---|---|
| Lo-fi ambient | Coding tutorials, long-form content | Chillhop, ambient electronica |
| Minimal electronic | Fast-paced demos, CLI tools | Downtempo, minimal techno |
| Acoustic minimal | Explanation-heavy content | Solo piano, acoustic guitar loops |
| Silence | Complex topics requiring full concentration | None -- sometimes no music is best |
The Volume Balance Formula
Getting the relative volumes right matters more than any other mixing decision. The standard formula for tutorial content:
- Narration: -14 LUFS (YouTube's target loudness)
- Background music during speech: -35 to -40 LUFS
- Background music during silence: -25 to -30 LUFS
- Sound effects: -28 to -32 LUFS
If you can consciously hear the background music while the narrator is speaking, it is too loud. The music should register subconsciously -- filling silence without demanding attention.
Common Mistakes
The most frequent ducking mistakes in automated systems:
- Ducking too aggressively -- music disappears completely during speech, then appears suddenly during pauses, creating a "pumping" effect
- Ducking too slowly -- the first word of each narration segment competes with the music before the ducker reacts
- Not ducking during keyboard sounds -- if the recording includes typing sounds, the music should duck for those too, as they are part of the content audio
- Abrupt music endings -- the music should fade out gracefully at the end of the video, not stop abruptly with the last frame
VidNo handles background music mixing as part of its audio pipeline. It detects speech segments from the generated voiceover, applies ducking with tuned attack/release curves, and normalizes the final mix to YouTube's -14 LUFS target. Music selection is configurable -- choose a genre category or provide your own royalty-free tracks.
Testing Your Mix
Before publishing, test your audio mix in the conditions your viewers will experience it. Play the video through phone speakers -- if the music overwhelms the narration on small speakers, it is too loud. Play it through noise-canceling headphones -- if the ducking transitions are audible as pumping artifacts, the attack and release curves need adjustment. Play it at 50% volume -- if the narration is still clear and the music is barely perceptible, your balance is correct.
The most revealing test is the "background listening" test. Play your video while doing something else. If the narration pulls your attention back at key moments and the music fills the gaps without demanding attention, the mix is working as intended. If you notice the music consciously at any point during narration, reduce its level by another 2-3 dB.
Automated mixing gets you 90% of the way to a professional audio balance. The remaining 10% comes from testing in real-world listening conditions and making small adjustments to your pipeline's volume parameters. Those adjustments carry forward to every future video, so the time investment pays off repeatedly.