Updated: May 9, 2026
characterThis ComfyUI workflow is designed for Qwen3-TTS voice generation, preset speaker testing, custom voice creation, and reference-based voice cloning. It combines three practical audio generation routes in one workflow: Voice Clone, Custom Voice, and Design Voice. Instead of only offering a single text-to-speech path, this workflow gives creators several ways to generate voices depending on whether they want to clone an existing reference voice, use a preset speaker, or design a new voice through descriptive character instructions.
The workflow is built around Qwen3-TTS ComfyUI nodes and audio preprocessing tools. It includes FB_Qwen3TTSVoiceClone for reference-based voice cloning, FB_Qwen3TTSCustomVoice for preset speaker generation, FB_Qwen3TTSVoiceDesign for instruction-based custom voice design, MelBandRoFormer for vocal extraction, Whisper Large V3 for automatic transcription, LoadAudio for importing reference audio, PreviewAudio for quick listening, and SaveAudio for exporting final results.
The first route is Voice Clone. This route is useful when you already have a reference audio sample and want Qwen3-TTS to generate new speech in a similar vocal style. The workflow loads the reference audio, separates the vocal track with MelBandRoFormer, transcribes the reference voice with Whisper, and then passes the cleaned reference audio and reference text into the Qwen3-TTS voice clone node. This makes the workflow suitable for voice imitation tests, narration style transfer, character voice reuse, AI dubbing, and digital human voice production.
MelBandRoFormer is important in the cloning route because many reference audio samples are not perfectly clean. They may contain background music, room noise, ambience, or mixed sound effects. By extracting the vocal part before cloning, the workflow gives Qwen3-TTS a cleaner voice reference. This can improve speaker consistency, reduce unwanted background artifacts, and make the generated voice more stable.
Whisper transcription is also important. Voice cloning works better when the reference audio and reference transcript match. The Apply Whisper node automatically transcribes the extracted vocal audio, so users do not always need to manually type the reference text. This is especially useful for longer reference clips or audio samples taken from existing videos. However, for production results, it is still recommended to check the transcript and correct errors before final generation.
The second route is Custom Voice. This route uses a preset speaker option inside the Qwen3-TTS custom voice node. In the included workflow, the speaker setting uses a preset voice such as Eric, with Chinese language mode, bf16 precision, and a text prompt for direct speech generation. This route is useful when users do not want to upload a reference voice and only need a fast, stable TTS result from a predefined speaker style.
The Custom Voice route is suitable for quick testing, short-form narration, rough audio drafts, social media voiceover, product explanation voice, tutorial speech, and stylized dialogue generation. Because it does not depend on an external reference audio file, it is easier to use for fast online generation. Users can change the text, speaker, language, seed, and instruction settings to test different vocal outputs.
The third route is Design Voice. This route is instruction-based. Instead of cloning an existing voice or selecting a fixed preset speaker, the user writes a voice profile describing the character identity, vocal texture, age, emotional state, personality, background, and speaking style. The FB_Qwen3TTSVoiceDesign node then generates speech according to this voice design prompt.
This is especially useful for original character creation. For example, the workflow can describe a senior strategic scientist with a deep, powerful, steady middle-aged male voice, strong emotional weight, national responsibility, and a serious speaking style. The TTS model then attempts to generate speech that matches the written character profile. This makes the Design Voice route valuable for AI short dramas, game NPC voice design, fictional character narration, audiobook prototypes, animation dubbing, and concept voice testing.
A key benefit of this workflow is that it separates three common voice-generation needs. If you want to copy the tone of a reference sample, use Voice Clone. If you want a fast preset speaker, use Custom Voice. If you want to invent a new character voice from text instructions, use Design Voice. This makes the workflow more flexible than a single TTS graph.
The workflow also includes PreviewAudio nodes so users can immediately listen to generated results inside ComfyUI. This is important for audio workflows because quality cannot be judged visually. Users need to hear pronunciation, rhythm, emotional tone, voice texture, pacing, and stability. PreviewAudio makes rapid iteration easier before saving the final audio.
The SaveAudio node is used to export audio output for later use in video editing, digital human workflows, lip-sync workflows, podcast production, narration assets, or AI short film production. Generated voice can be used as a standalone audio track, or combined with video-generation workflows such as talking head animation, character dialogue, or image-to-video scenes.
Main features:
- Qwen3-TTS audio generation workflow
- Three routes in one graph: Voice Clone, Custom Voice, and Design Voice
- Reference-based voice cloning
- Preset speaker text-to-speech generation
- Instruction-based custom voice design
- MelBandRoFormer vocal extraction
- Whisper Large V3 automatic transcription
- LoadAudio reference voice input
- PreviewAudio for quick listening
- SaveAudio for final export
- Supports Chinese voice generation
- Supports seed control and model choice settings
- Suitable for AI video voiceover, dubbing, narration, and character voice design
Recommended use cases:
Voice cloning tests, preset speaker TTS generation, custom character voice design, AI short video narration, digital human voiceover, game NPC voice prototyping, audiobook sample generation, AI drama dialogue, product explanation audio, Bilibili voiceover production, YouTube narration, character concept testing, lip-sync audio preparation, voice style comparison, and ComfyUI audio workflow research.
Suggested workflow:
Start with the Voice Clone route if you have a reference audio sample. Upload a clear reference voice through LoadAudio. Use a clip with clean speech, stable volume, and minimal background noise. If the source contains background music or ambience, MelBandRoFormer can help isolate the vocal part, but a clean source is still better.
After vocal separation, check the Whisper transcription. If the transcript is wrong, correct it before running the final voice clone generation. The reference text should match the reference audio as closely as possible. This improves stability and helps the model understand the reference speaker.
Use shorter target text for early tests. One or two sentences are enough to check whether the cloned voice sounds stable. After you find a good seed and reference setup, you can test longer text. Long paragraphs may need to be split into smaller segments for better rhythm and fewer pronunciation problems.
Use the Custom Voice route when you want a quick preset speaker result. Select a speaker, enter the target text, choose the language, and generate. This is useful for fast preview, simple narration, rough production drafts, and situations where voice identity does not need to match a specific person.
Use the Design Voice route when you want to create an original voice. Write a detailed voice profile. Include role identity, age, gender presentation, vocal tone, speaking energy, personality, emotional state, and scene context. For example, a deep and authoritative scientist voice, a calm documentary narrator, a warm product presenter, or a sharp comedic character voice.
For better Design Voice results, make the instruction specific but not overloaded. Describe the voice clearly, but avoid too many conflicting traits. If you ask for a voice that is both old and young, calm and explosive, soft and extremely powerful, the output may become unstable. A focused character profile usually gives better results.
Use seed control for repeatability. If you get a good voice style, keep the seed fixed. If the voice feels unnatural, randomize the seed and test again. The same text and instruction can produce different delivery styles depending on the seed.
Preview the generated audio before exporting. Listen for pronunciation, pacing, emotional expression, noise, clipping, and whether the voice matches the intended role. If the output is too flat, strengthen the emotion in the instruction. If it is too exaggerated, simplify the instruction and reduce dramatic wording.
For production use, you can export the audio and then apply light post-processing such as loudness normalization, EQ, compression, or noise reduction. This can help the generated voice sit better inside videos, podcasts, short dramas, and lip-sync scenes.
This workflow is designed for creators who need flexible Qwen3-TTS voice generation inside ComfyUI. It is not only a simple TTS template; it is a practical audio production workflow for cloning voices, testing preset speakers, and designing original character voices. It is useful for AI creators, video producers, game developers, digital human creators, audiobook testers, and anyone building voice assets for AIGC production.
Responsible use note: only clone voices that you own, have permission to use, or are legally allowed to reproduce. Do not use this workflow to impersonate real people without consent, mislead audiences, commit fraud, or create deceptive audio. For public-facing content, it is recommended to disclose when a voice is AI-generated.
🎥 YouTube Video Tutorial
Want to know what this workflow actually does and how to start fast?
This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.
Everything starts directly on RunningHub, so you can experience it in action first.
👉 YouTube Tutorial: https://youtu.be/iHM2VOtUAZ0
Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.
⚙️ RunningHub Workflow
Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2015365617691402241?inviteCode=rh-v1111
If the results meet your expectations, you can later deploy it locally for customization.
🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!
📺 Bilibili Updates (Mainland China & Asia-Pacific)
If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV132zxBsEAX/
☕ Support Me on Ko-fi
If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk
💼 Business Contact
For collaboration or inquiries, please contact aiksk95 on WeChat.
🎥 YouTube 视频教程
想了解这个工作流到底是怎样的工具,以及如何快速启动?
视频主要介绍 工具定位、快速启动方法 和 我的构筑思路。
我们会直接在 RunningHub 上进行演示,让你第一时间看到实际效果。
👉 YouTube 教程: https://youtu.be/iHM2VOtUAZ0
开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手,也能少走常见弯路。
⚙️ 在线体验工作流
现在就可以在线体验,无需安装。
👉 工作流: https://www.runninghub.ai/post/2015365617691402241?inviteCode=rh-v1111
打开上方链接即可直接运行该工作流,实时查看生成效果。
如果觉得效果理想,你也可以在本地进行自定义部署。
🎁 粉丝福利: 注册即送 1000 积分,每日登录 100 积分,畅玩 4090 体验 48 G 超级性能!
📺 Bilibili 更新(中国大陆及南亚太地区)
如果你在中国大陆或南亚太地区,可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频: https://www.bilibili.com/video/BV132zxBsEAX/
我会在 夸克网盘 持续更新模型资源:
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户,方便进行创作与学习。

