Definition
MOSS TTS is an open-source text-to-speech model that VidNo uses for local voice synthesis. Unlike cloud-based TTS services that require uploading your script to external servers and paying per character, MOSS runs entirely on your local GPU, producing high-quality speech synthesis with no data leaving your machine and no per-use cost. The model is notable for its natural prosody — the rhythm, stress, and intonation patterns that make speech sound human rather than robotic. MOSS handles technical vocabulary well, correctly pronouncing programming terms, framework names, and acronyms that trip up general-purpose TTS systems. It supports voice cloning by fine-tuning on short audio samples of a target voice, allowing your videos to feature narration that sounds like your own voice. VidNo integrates MOSS as the default TTS engine in its pipeline, managing model loading, GPU memory allocation, and audio output formatting automatically. The synthesized audio is rendered at broadcast quality and synced to the video timeline during the compositing stage.