Clone Any Voice and Generate Natural Speech with VibeVoice in ComfyUI

You have a voice sample. You have a script. You want the audio to sound like that person actually read it, natural pacing, real delivery, not robotic TTS.

Voice sample in. Script in. Natural speech out.

Run it now on Floyo!

Need multiple speakers in one conversation? Use the multi-speaker version

Why VibeVoice

Traditional TTS reads text mechanically. VibeVoice uses an LLM to understand context before generating audio, so it handles tone shifts, natural pauses, emotional delivery, and pacing the way a real person would.

Upload a sample. Type your script. Hit run. The output is a clean audio file ready to use as voiceover, narration, or dialogue.

LLM-powered context understanding before audio generation
handles long-form scripts up to ~90 minutes of audio
natural punctuation-driven pacing, periods, commas, line breaks all affect delivery
load scripts from a .txt file for production workflows
pairs directly with lip sync workflows like MultiTalk

Key Inputs

Voice Sample

Upload an audio clip of the voice you want to clone. MP3 or WAV both work. You can also extract audio from a video file.

20+ seconds gives the best voice match
clean recordings produce the most accurate clones
background music, echo, or noise bleeds into the cloned voice

Shorter clips (5–10 seconds) work but the voice match may drift, especially on longer output scripts.

Text

Type or paste your script. VibeVoice reads punctuation naturally:

periods create full stops
commas create brief pauses
question marks shift intonation upward
line breaks create longer pauses between sections

For longer scripts, enable the LoadTextFromFile node (bypassed by default) and point it to a .txt file in your ComfyUI directory.

Model

VibeVoice-Large (18.7GB): default and recommended. Best quality, most natural delivery on complex scripts
VibeVoice-Large 8-bit (11.6GB): close to full quality, lower VRAM
VibeVoice-Large 4-bit (6.6GB): fits on less VRAM, slight quality drop
VibeVoice-1.5B (5.4GB): fastest, noticeably lower quality for complex delivery

Use Large unless VRAM is a constraint.

Key Settings

Voice Speed Factor: default 1.0. Keep between 0.95 and 1.05. Going further distorts output.
Diffusion Steps: default 20. Drop to 10–15 for quick previews. Raise to 25–30 for final production.
CFG Scale: default 1.3. Higher sticks closer to your text. Lower lets the voice's natural rhythm lead.
Temperature: default 0.95. Higher adds expressive variation. Lower makes output more consistent and predictable.
Max Words Per Chunk: default 250. The model chunks and stitches automatically for long scripts.
Seed: fixed at 42 by default for reproducible output. Switch to randomize for variation between runs.

What This Is Great For

AI video voiceover: generate narration or dialogue and layer it onto AI-generated video clips. Pair with Wan or LTX video workflows, then combine in post or feed into a lip sync workflow.

Podcast and content production: write a script, clone a voice, generate the audio. For multi-speaker shows, run the workflow once per voice and combine the outputs. Or use the multi-speaker workflow directly.

Audiobook and e-learning narration: long-form text works well. VibeVoice processes in chunks and keeps delivery consistent across extended scripts. Strong for course narration, guided content, and documentation.

Character voices for games and animation: clone a voice sample, generate all lines from a script, iterate on delivery by adjusting temperature. Faster than booking studio time for every revision.

Lip sync pipelines: generate speech here, then feed into MultiTalk, FantasyTalking, or InfiniteTalk with a portrait image for matching mouth movement and facial expressions.

What to Watch Out For

Short or noisy voice samples produce less accurate clones. 20+ seconds of clean audio is the most important input quality factor.

English works best. Other languages may produce less natural results depending on the model version.

Singing is not supported. This workflow is for speech generation only.

Speed factor adjustments beyond 0.95–1.05 distort the output. Keep adjustments small.

For multiple speakers in one conversation, run the workflow once per speaker and edit the outputs together or use the multi-speaker workflow.