Calling a TTS API is trivial. Integrating voice synthesis into a video pipeline that runs unattended, handles failures gracefully, and produces consistent output across hundreds of videos -- that is where the engineering happens.

Architecture Decisions That Matter Early

Before you write a single API call, settle these questions. Getting them wrong means rebuilding later, which is expensive when you already have a hundred videos in your pipeline configuration.

  • Sync vs async generation: Do you block your pipeline waiting for audio, or queue synthesis jobs and poll for completion? Blocking is simpler but serializes your pipeline. Async lets you process other pipeline stages while waiting for audio.
  • Voice consistency: Are you using a fixed voice ID, or a cloned voice that might drift between API versions? Cloned voices produce better output but introduce a maintenance burden when providers update their models.
  • Segment granularity: Do you synthesize the entire script as one audio file, or break it into per-section chunks for easier timeline alignment? Chunked synthesis costs more API calls but gives you finer control over pacing and makes error recovery granular.
  • Fallback strategy: What happens when the API returns a 429 or the audio has artifacts? You need automated retry with exponential backoff and a quality validation step that can flag bad output.

A Practical Pipeline Architecture

Here is the flow that works reliably at scale. Each step is a discrete function that can be tested independently and replaced without affecting the rest of the pipeline:

Script (text)
  -> Segment splitter (by H2 headings or scene markers)
  -> SSML annotator (add pauses, emphasis)
  -> TTS API (per segment, with retry logic)
  -> Audio validator (check duration, detect silence gaps)
  -> Concatenator (FFmpeg concat demuxer)
  -> Loudness normalizer (-14 LUFS)
  -> Final audio track

Each step is a discrete function. If the TTS API fails on segment 7 of 12, you retry segment 7 only. If the audio validator detects a silence gap longer than 2 seconds mid-sentence, that segment gets regenerated. The rest of the segments remain cached and are not re-synthesized, saving both time and API credits.

Stop editing. Start shipping.

VidNo turns your coding sessions into YouTube videos — scripted, edited, thumbnailed, and uploaded. Shorts included. One command.

Try VidNo Free

The Segment Splitter

Never send your entire script as one API call. A 10-minute narration script sent as a single request creates a single point of failure and gives you zero control over pacing between sections. Split on logical boundaries -- section headers, scene transitions, or every 200 words, whichever comes first. Store each segment with metadata including its position in the final timeline, the expected voice parameters, and any SSML annotations specific to that segment.

Retry Logic That Actually Works

async function synthesizeWithRetry(
  text: string, voiceId: string, maxRetries = 3
): Promise<Buffer> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const audio = await ttsApi.synthesize({ text, voiceId });
      if (validateAudio(audio)) return audio;
      console.warn('Attempt ' + (attempt + 1) + ': audio validation failed, retrying');
    } catch (err: any) {
      if (err.status === 429) {
        await sleep(Math.pow(2, attempt) * 1000);
        continue;
      }
      throw err;
    }
  }
  throw new Error('Synthesis failed after max retries');
}

The validation function checks for minimum audio duration (to catch truncated output), maximum silence duration within the clip (to catch skipped words), and expected duration range based on word count. A 100-word segment should produce roughly 40 seconds of audio at normal speaking pace -- if you get 10 seconds back, something went wrong.

Voice Cloning Considerations

If you are using a cloned voice -- which VidNo supports for maintaining channel identity -- store the voice model ID in your pipeline config, not hardcoded. Voice providers occasionally deprecate model versions, and you need to be able to swap without touching pipeline code. Version-pin your model selection so that provider updates do not silently change your output characteristics.

Cost Control

Voice synthesis APIs bill per character. A 10-minute video script runs roughly 1,500 words or about 8,000 characters. At ElevenLabs pricing, that is roughly $0.24 per video at scale tier. Multiply by your publishing frequency to get your monthly voice budget. Caching helps significantly -- if you regenerate a video but the script has not changed, skip synthesis entirely and reuse the cached audio file. Implement a content hash on each segment so your pipeline can detect unchanged segments and avoid re-synthesizing them.

Audio-Video Synchronization

The synthesized audio dictates your video timeline. Generate audio first, measure its duration with FFmpeg's ffprobe, then build your video track to match. Going the other direction -- generating video first and fitting audio into it -- always produces awkward pacing where narration rushes through important sections and drags through simple ones. Audio-first production is a fundamental principle of narration-driven video.