Sign In

Two-Person InfiniteTalk Native Loop Long-Duration Workflow

Updated: May 9, 2026

character

Download

1 variant available

Archive Other

10.12 KB

Verified:

Type

Workflows

Stats

27

Reviews

Published

May 9, 2026

Base Model

LTXV2

Hash

AutoV2
2D8A115653
default creator card background decoration
AIKSK's Avatar

AIKSK

This ComfyUI workflow is designed for two-person InfiniteTalk native looping, dual-speaker talking video generation, and long-duration audio-driven character interaction. The main goal of this workflow is to generate a two-character dialogue video from a start frame and audio input, then continue the output through a native loop structure so creators can extend the video duration more naturally across multiple segments.

Unlike a simple single-person talking-head workflow, this graph is built for two speakers. It uses two speaker regions, two audio encoder outputs, character masks, InfiniteTalk multi-speaker model patching, previous-frame continuation, and repeated video generation stages. This makes it suitable for AI dialogue scenes, two-person digital human videos, interview-style content, virtual host conversations, short drama dialogue, character interaction videos, product explanation conversations, and long-form AI video narration.

The workflow is built around the Wan 2.1 InfiniteTalk multi-speaker pipeline. It uses a Wan video model route, UMT5 text encoder, Wan VAE, wav2vec2 audio encoder, InfiniteTalk multi-speaker model patch, start image input, two character masks, two audio encoder outputs, sampler control, continuation frames, and CreateVideo / SaveVideo output nodes. The central generation module is WanInfiniteTalkToVideo, which receives the model, InfiniteTalk model patch, positive and negative conditioning, VAE, audio features, start image, previous frames, speaker masks, width, height, video length, motion frame count, and audio scale.

The key feature of this workflow is dual-speaker native continuation. In many AI video workflows, a two-person scene is difficult to maintain because the model may not know which person should speak, which mouth should move, or how to keep both characters stable. This workflow solves that problem by using speaker-specific masks and audio encoder outputs. Character 1 and Character 2 can each have their own mask region, allowing the model to understand where each speaking area is located.

The workflow includes instructions for drawing masks in the ComfyUI MaskEditor. Users upload the start frame, open the image in MaskEditor, then draw the mask for Character 1. The same process is repeated for Character 2. These masks are important because they define the active speaker regions. Without clear masks, the model may move the wrong face, animate both characters at the wrong time, or create unstable mouth movement.

The model patch used in this workflow is the InfiniteTalk multi-speaker patch. This is different from a single-speaker setup. The multi-speaker route is designed for dialogue scenarios where more than one character needs audio-driven motion. The workflow uses the two_speakers mode inside WanInfiniteTalkToVideo, with inputs for audio_encoder_output_1, audio_encoder_output_2, mask_1, and mask_2. This allows the graph to handle two separate speaker controls inside the same video scene.

The audio encoder is loaded through AudioEncoderLoader, using wav2vec2-chinese-base_fp16. The audio encoder extracts speech features from the input audio. These features are then used to drive the character mouth movement and facial performance. For a two-person dialogue scene, clean audio is especially important. If the audio contains overlapping speakers, heavy background music, echo, or unclear speech, the model may have a harder time deciding which character should respond.

The workflow can be used for two-person dialogue, but users should prepare audio carefully. A clean dialogue track, segmented speaker audio, or clearly separated voice arrangement will usually produce better results than a noisy mixed track. If the workflow is set up with two separate audio encoder outputs, each speaker’s audio should correspond to the correct character mask. This helps the model connect Character 1’s voice to Character 1’s face, and Character 2’s voice to Character 2’s face.

The start image defines the entire visual scene. It should contain two visible characters with clear faces, readable mouth areas, and stable composition. A two-person front-facing or semi-front-facing scene usually works better than extreme angles. Both faces should be large enough for the model to animate. If one person is too small, blocked, blurred, or heavily turned away from camera, the lip movement may become weak or unstable.

The text prompt controls the overall visual behavior. For this type of workflow, the prompt should usually be stable and conservative. It should describe two people speaking naturally, subtle head movement, stable camera, clear facial expressions, consistent lighting, and natural dialogue performance. Avoid prompts that introduce excessive body motion, camera shake, complex action, or identity-changing style shifts, because those can reduce continuity in long videos.

The negative prompt should suppress common dialogue-video problems. Useful negative terms include wrong speaker movement, both mouths moving at the same time, distorted mouth, bad lip sync, flickering face, unstable eyes, broken teeth, duplicated mouth, deformed jaw, face drift, identity change, exaggerated head motion, camera shake, blurry face, and unnatural expression. For two-person videos, it is also useful to suppress speaker confusion and unwanted background motion.

The workflow uses SamplerCustomAdvanced, CFGGuider, BasicScheduler, KSamplerSelect, and RandomNoise for the generation process. This gives users control over seed variation, guidance strength, scheduler behavior, and sampling. The included setup uses a compact sampling structure, which is practical for iterative testing. Two-person dialogue generation often requires several tests to find the best mask, audio, and seed combination.

Native looping is the second key feature. Instead of generating one isolated short clip, this workflow supports continuation through previous frames. The previous_frames input allows a later generation stage to continue from the visual state of an earlier segment. This is important for long-duration output because it helps reduce hard resets between clips. The workflow can generate one segment, use its ending or previous frames as continuity context, and then generate the next segment.

This is why the workflow is useful for “infinite duration” production. It does not mean a single generation can run forever. In practical production, the user splits a long dialogue or narration into segments, generates each section, then uses previous-frame continuation to keep the visual flow connected. This gives creators a repeatable method for building longer AI talking videos while maintaining stronger continuity than manual looping.

The workflow includes two generation stages and video creation nodes. One stage can create the first dialogue segment from the start frame, masks, and audio features. Another stage can continue from previous frames. The outputs are decoded through VAEDecode and converted into video through CreateVideo or SaveVideo. The final output can include audio, making it ready for preview or publishing.

The output fps is set to 25 in the CreateVideo nodes. Keeping the same fps across all generated segments is important. If the fps changes between segments, the audio may drift, the mouth timing may feel wrong, or the transition between clips may become unstable. For long-form generation, users should keep fps, resolution, prompt style, and mask setup consistent across all segments.

The workflow also includes model notes for local users, covering Wan diffusion model files, UMT5 text encoder, InfiniteTalk single and multi model patches, wav2vec2 audio encoder, Wan VAE, and LightX2V LoRA. This makes the graph more useful for both RunningHub online users and local ComfyUI users who want to deploy the workflow themselves.

Main features:

- Two-person InfiniteTalk native loop workflow

- Dual-speaker audio-driven video generation

- WanInfiniteTalkToVideo two_speakers mode

- InfiniteTalk multi-speaker model patch support

- Two character mask control

- Two audio encoder output support

- Start image to two-person talking video

- Previous-frame continuation for longer video generation

- Native loop / continuation structure

- wav2vec2 Chinese audio encoder support

- UMT5 Wan text encoder support

- Wan VAE decoding support

- SamplerCustomAdvanced generation pipeline

- CFGGuider and BasicScheduler control

- CreateVideo and SaveVideo output

- Suitable for interviews, dialogue scenes, short dramas, and virtual host conversations

Recommended use cases:

Two-person digital human video, AI interview scene, dual-host talking video, virtual host conversation, short drama dialogue, character interaction video, AI podcast-style video, product explanation conversation, two-character animation dubbing, social media dialogue clip, Bilibili creator content, YouTube talking video, long-form AI narration, InfiniteTalk loop testing, RunningHub workflow publishing, and Civitai video workflow demonstration.

Suggested workflow:

Start by preparing a clean two-person start image. Both characters should be clearly visible, with readable faces and mouth areas. Avoid images where one face is too small, blocked by objects, heavily blurred, or turned too far away from camera. The more stable the start image, the easier it is for the workflow to maintain identity across long outputs.

Next, draw masks for both characters. Open the start image in MaskEditor and draw the mask for Character 1. Then repeat the process for Character 2. The masks should cover the speaker regions clearly. In most cases, include the face and upper head area, and optionally part of the neck or upper body if subtle motion is needed. Do not make masks too large unless you want more body movement.

Prepare the audio carefully. For best results, use clean dialogue audio with clear speech. If the workflow is using two separate audio encoder outputs, make sure each audio input corresponds to the correct character. Character 1’s audio should drive Character 1’s mask. Character 2’s audio should drive Character 2’s mask. If the audio assignment is wrong, the wrong person may speak.

Write a stable prompt. The prompt should describe two people speaking naturally, facing the camera or facing each other, with subtle head movement, stable lighting, consistent background, and natural mouth motion. For long videos, avoid adding too many changing scene details, because every extra style change can increase drift.

Use a clear negative prompt. Suppress wrong speaker movement, both mouths moving together, distorted mouths, flickering, identity drift, face deformation, unstable eyes, bad teeth, excessive head motion, camera shake, and background distortion. These are common problems in two-person talking video workflows.

Generate a short first segment. Do not start with a long video immediately. Test a short clip first to confirm that the correct character moves with the correct audio, both faces remain stable, and the masks are working. If the wrong character moves, check the mask and audio connections. If both characters move too much, refine the masks and reduce unnecessary motion language in the prompt.

Use the native loop continuation section for longer output. After generating the first segment, use previous frames from the end of that segment as context for the next segment. This helps the next clip continue from the earlier visual state instead of restarting from the original still image. This is the core method for building long-duration output.

Split long dialogue into smaller audio sections. Generating shorter connected segments is usually more stable than trying to generate a very long clip in one pass. If one segment fails, only that segment needs to be regenerated. This is also better for controlling lip sync, face stability, and transition quality.

Keep settings consistent across segments. Use the same resolution, fps, prompt style, masks, and similar sampler settings. If these change too much, the final long video may show visible differences between segments. Consistency is more important than aggressive motion when building long-form dialogue videos.

Check transition points carefully. Look at eye direction, mouth shape, face identity, head position, lighting, and background. If the transition feels abrupt, use a better previous-frame selection or shorten the segment length. A good continuation should feel like the same scene is continuing, not restarting.

For interview-style scenes, keep the characters stable and avoid excessive camera movement. For short drama scenes, allow more expression but keep masks accurate. For virtual host dialogue, use clean lighting and clear frontal composition. For product conversation videos, keep the background simple and focus on speech clarity.

This workflow is designed for creators who need a practical two-person InfiniteTalk pipeline inside ComfyUI. It combines dual-character masking, audio feature extraction, InfiniteTalk multi-speaker generation, previous-frame continuation, sampler control, VAE decoding, and final video output into one workflow. It is especially useful for building longer AI dialogue videos without manually reconstructing every segment from scratch.

🎥 YouTube Video Tutorial

Want to know what this workflow actually does and how to start fast?

This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.

Everything starts directly on RunningHub, so you can experience it in action first.

👉 YouTube Tutorial: https://youtu.be/OjsHOyPtF0s

Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.

👉 Workflow: https://www.runninghub.ai/post/2018267797146570753/?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.

📺 Bilibili Video: https://www.bilibili.com/video/BV1mwFLzTELL/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.

Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.

👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

🎥 YouTube 视频教程

想了解这个工作流到底是怎样的工具,以及如何快速启动?

视频主要介绍 工具定位、快速启动方法 和 我的构筑思路。

我们会直接在 RunningHub 上进行演示,让你第一时间看到实际效果。

👉 YouTube 教程: https://youtu.be/OjsHOyPtF0s

开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手,也能少走常见弯路。

⚙️ 在线体验工作流

现在就可以在线体验,无需安装。

👉 工作流: https://www.runninghub.ai/post/2018267797146570753/?inviteCode=rh-v1111

打开上方链接即可直接运行该工作流,实时查看生成效果。

如果觉得效果理想,你也可以在本地进行自定义部署。

🎁 粉丝福利: 注册即送 1000 积分,每日登录 100 积分,畅玩 4090 体验 48 G 超级性能!

📺 Bilibili 更新(中国大陆及南亚太地区)

如果你在中国大陆或南亚太地区,可以通过下方视频查看该工作流的实测效果与构思讲解。

📺 B站视频: https://www.bilibili.com/video/BV1mwFLzTELL/

我会在 夸克网盘 持续更新模型资源:

👉 https://pan.quark.cn/s/20c6f6f8d87b

这些资源主要面向本地用户,方便进行创作与学习。