home models images videos 3D Models articles comics challenges updates shop

Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU

Name: Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU
Rating: 5 (76 reviews)
Author: arkinson

1.6k

Updated: Jan 26, 2026

tool

video 12 gb vram wan2.2 audio rtx 3060

Download

1 variant available

Archive Other

wan22 Video Audio v2.0.json.zip

26.87 KB

Verified: 6 months ago

Download (26.87 KB)

Details

Type

Other

Stats

848

Reviews

Positive

(46)

Published

Jan 23, 2026

Base Model

Other

Hash

AutoV2

BE763F19DE

About this version

default creator card background decoration

3.1K

284.3K

arkinson

Joined Apr 7, 2023

[edit: 23.01.2026 use last version v2.0 now (see version description)

Workaround for small isue in v2.0 with audio part: Go to the bottom right of the ‘01 Audio...’ group and simply move the ‘Any Switch’ node from the subgroup ‘01.1.3’ to a free area in the ‘01’ group and make sure the node is not bypassed.

I will fix this in the next version].

Special thanks to:

@boinobin730 for lot of testing, sharing knowlage and pushing this project 🙂

@SeoulSeeker for sharing his knowlage and giving the first crucial hints.

Features:

This workflow uses InfiniteTalk to generate videos of a talking/singing person/objekt. The resulting video is guided/controlled by a start image, an audio source (speech/voice/song) and a control video to guide the general movement. I designed it as an all-in-one workflow. You just need a start image and/or optional audio/video source.

- Works perfect on RTX 3060 with 12 GB VRAM and 32 GB RAM + large swap file (min. 64 - 128 GB).

- Easy installation (all necessary models linked).

- Easy to use via switch options.

- High Quality outputs.

The workflow includes 4 simple steps:

1. Audio generation or load,

2. Video generation or load for DWPose motion control,

3. InfiniteTalk: generates the final LQ video output (guided by DWPose and audio syncronised,

4. Upscaling and framrate multiplying for smoth HQ outputs.

Videos of around 5 seconds working well. Longer videos (around 10 seconds) are possible, but you might run quickly into known video issues, like looping movements, OOM errors, etc.

This workflow is quite advanced now - I would say in an early beta status. Everything should work technically. So I believe it is a good basis for more advanced tests and hopfully some fun 🙂

My intention is to integrate the Step Audio EditX Engine for easy to use advanced audio control via tags soon as possible. But actually there are some issues with the corresponding nodes.

A next step might be the integration of camera control.

Attention:

This workflow is intended for advanced comfyui users. Even installation and usage should be simple, this workflow is actually a basis for testing and developing and you might need some comfyui knowlage to use it. Please understand, I will not give basic installation and comfyui support here.

If you are a beginner with vido generation and more complex workflows, I would recommend you my other workflow for video generation. This one has been well tested and is allready much better documented and commented.

About the basics:

This workflow based on official templates and different allready published workflows. I just put different parts together, created a hopefully easy-to-use “design” and optimized everything for 12 GB VRAM.