Using Wan with Vace to make a transformation video

Introduction

In the old days (3 months ago), if you wanted to make a transformation video with AI, you had to use one of the online sites that provided begin and end frame options for their videos. Sites like Kling and Pixverse were popular choices. However, setting up a video this way is difficult with AI-generated images since it's very, very difficult to make two images with similar enough backgrounds and clothing and all the other details that make for a consistent progression. With inconsistent end frames, the video usually included weird warping and objects popping in and out of visibility.

Now, with the many tools that work with Wan 2.1 video creation, it's much easier to create a consistent video. I've made several clips using this method so I know it works. It sometimes works on the first or second try. It sometimes takes a lot more tries to get the clip just right. Luckily, with ComfyUI, all it costs is time. You have unlimited generation credits on your home computer.

Overview

My process is in 3 parts:

Create a starting image
Use an image-to-video workflow to build a beginning clip
Extend the clip from 2 with a transformation sequence and any follow-up

Using Wan with Vace, it's possible to do these steps easily. By starting with a single image and working from there, you are ensured that the background and other details will be consistent from beginning to end. I use a separate workflow for each step. That may sound complicated, but it's really just three tabs in my browser. And it's much easier to use those extra tabs than to figure out how to make one do-everything workflow in ComfyUI. Since I'm often running each step multiple times to get the result I have in mind, it doesn't make much sense to have it all in one workflow anyway.

Part 1: Use your favorite image generation tool. For Flux images, I use my Daily Driver workflow.

Part 2: You can use native Wan image to video, but I like Wan+Vace. The workflow I use for that is here.

Part 3: To extend a video, I recommend my Wan+Vace Extend Video workflow.

Step One: Starting off

Pick your favorite image tool and generate a starting image. If you're going to use Wan with Vace, you should keep in mind that Vace is more restrictive than Wan about what resolutions it will support. If you try to use something non-standard, it will just give you an error. It will save you some frustration if your starting image uses the same width:height ratio that you will want in your video clip. You can scale the image easily, but you may be unhappy if you find out you have to crop it to make it fit.

This list is probably not complete, but the ratios I know of for Vace, with example sizes, are:

square - 720x720, 1024x1024
2:3 - 480x720, 768x1152
15:26 - 480x832

I suggest making an image that's bigger than you want for your video, as long as you use the same overall ratio. If you want your video to be 480x720, Flux won't make as nice an image at that resolution. You're better off using 768x1152 and then scaling the image down. If you use my workflows, the scaling is automatic if you enter the final height you want for your clip.

Step Two: First clip

What to put in the video? That's up to you. For a very quick scene, you could prompt for a transformation right away in the first video. Personally, I prefer to begin the scene with a little motion from the static image before getting to the transformation. I like to make a video that's only 2 or 3 seconds long just to set the scene, but you could make a longer clip if you have a story to tell with this part.

I recommend making a video that's 16fps. This the default frames per second for Wan 2.1 and it will make extending the video much easier if you stick to this frame rate until your clip is complete. Once it's all done, you can add smoothing at that point and change the frame rate if you want. My image-to-video and extend video workflows all generate both 16fps and 32fps output so you can

Step Three: Transformation

Whether you want your transformation in the first clip or in an extended clip, the process is the same, but the details will be slightly different for each video.

Do you have a LoRA that provides the final result you want in your scene? Whether it's clothing or a different body shape or whatever, you should add the LoRA for that effect to your clip generation at this stage. You may need to boost the strength of the LoRA more than you would to generate a plain text-to-video clip. For example, if you normally use your LoRA at 1.0, you may find that it takes from 1.25 to 1.75 to give you the effect you want. Every LoRA is different since they all have slightly different strength, so you'll have to experiment. You can make some small videos (like 320x480) to make quick samples for testing.

If you don't have a LoRA, you will need to achieve as much as you can with prompting. In that case, you may have better luck NOT using CausVid or Lightx2v since both of those tools want a CFG of 1.0, and may not give your prompt enough strength to get the job done. You'll have to try it and see. It may depend on how strong of an effect you want to achieve.

Next, whether you have LoRA or not, you may be helped by using a reference image. This is a feature of the Vace generation process. If you provide a reference image with the background removed, your Vace will attempt to guide your video towards that image. By using a reference image that looks like the end result of your transformation, you will encourage Vace to move your character in that direction during the video. My Vace workflows support reference images and will remove the background for you automatically.

Step Three Continued: Prompting

I'm still new at this, so I'm sure there are other ways to get this done. I'm only sharing what I've found works for me.

The biggest challenge with transformation clips is that the AI engine very often wants to move from point A to point B in a flash, while we as viewers often want to see the process more slowly. If you want the "in a flash" effect, that's a whole different prompting challenge that I won't go into right now. In order to produce a gradual change, I've found some phrases seem to work more often than not.

The (man/woman) very slowly changes during the course of the scene.
(His/Her) (name your body part or object) very slowly becomes (name your target state) throughout the scene. Example: "Her hair very slowly grows very long throughout the scene."

In general, it's best to only spend two or three sentences on the actual transformation process. If the prompt is too long, each individual part of the prompt will not have as much weight. Better to get right to the point. My prompt structure usually looks something like this:

A (man/woman) is in a (whatever location). (insert transformation process sentences here). (spend the rest of the prompt describing what you want the scene to look like AT THE END of the sequence).

If your goal is to turn a short woman with curly brown hair into a tall skinny blonde woman with long straight hair, then you should ONLY describe the woman in her final state. Talk about her very long hair, how tall and skinny she is talk about her golden blonde hair, etc. Wan/Vace will work to achieve that description in the clip. Your other sentences about "very slowly changing" will help to make that a gradual process.

In addition to prompting, you will also likely see better results from using CausVid or Lightx2v. I know I have, when making these videos. I don't know precisely why, but it seems to me that those two performance boosters also restrict quick motion and change in video clips. It's a bit of a problem if you're making a regular scene since it can be hard to encourage dynamic action out of a CausVid video. But in this case, that problem is working in your favor. Adding CausVid to your LoRA stack will make it more likely that you'll see the transformation play out as a process instead of as a sudden warp.

Step Four and onwards: Continuation

If you've achieved the transformation you wanted, or at least most of it, now it's up to you to decide how much more of the story you want to play out. Is the character happy or sad or excited or what? What do they do next? With my Extend Video workflow, I've had success continuing to extend a video 3 or 4 times without any noticeable loss in quality. Being able to seamlessly meld with existing motion from the earlier clips is hugely important. You're only limited by your own time and patience.

At each step, keep these things in mind: You should describe the ACTION that you want to see DURING the video. You should describe the APPEARANCE that you want to see by the END of the video. That's true for both LoRA and prompting at every stage.

If you use my workflow to extend your video, be sure to keep using the 16fps output as your source video for the next step. When you've added as many clips as you want, you can just grab the 32fps output file from your last extension and use that for publishing. Or take the 16fps version and do your own post-processing or whatever.

Example

Goal

Create a scene with a regular-looking woman in a locker room wearing a home-made Supergirl costume. Suddenly she transforms into the real Supergirl.

Step 1

Simple flux prompt: "a woman is standing in a locker room wearing a homemade supergirl costume. She is short and plain-looking. Her hair is brown."

Step 2

Using my Wan Vace Image-to-Video workflow...

Image input from Step 1
Prompt: "A woman wearing a simple supergirl costume is in a locker room. She looks down at her costume then looks back up at the camera with a smile."
LoRAs: CausVid
Extension time: 4 seconds
Steps: 10
CFG: 1.0

Step 3

Using my Wan Vace Extend Video workflow. I have a Wan character LoRA for Supergirl, so I use that to help achieve the full body transformation more easily.

Source video: 16fps output from Step 2
Reference image: I chose a Supergirl image with a full front view from existing images I had from my Supergirl LoRA. This isn't really needed since I have the LoRA but I was testing all the features of the workflow. For example, this scene was made without using a reference image: https://civitai.com/images/88430231
Prompt: "supergrl. A woman in a locker room wearing a supergirl costume slowly changes into supergirl. She very slowly changes throughout the scene. Her hair very slowly grows long and turns blonde. Her body very slowly changes. Her face very slowly changes. She looks down at her body in shock and amazement. She touches her body with her hands in disbelief."
LoRAs: supergirl-wan-v1 (https://civitai.com/models/1617118/supergirl-wan) at 1.75 strength, CausVid
Extension time: 5 seconds
Steps: 10
CFG: 1.0

I stopped there, but I could have continued the scene if I'd wanted. For future clips, I would either turn down the Supergirl LoRA to 1.0 or even turn it off entirely. It depends what I want from the scene. With a reference image, Vace probably doesn't need much help from the LoRA to continue the scene. Without a reference image, the LoRA might help with more actions that Vace would have difficulty guessing. This choice will be up to you and depends on what tools you have and what you want in your scene.

Epilogue

I hope you've found this guide helpful. Please let me know in comments and post links to your creations if you try out this process. I'd love to see what other people come up with.

Good luck!

Examples

See the "Attachments" section for the sample files. It's at the top right of this window just below the Table of Contents.

I have attached the files used to make the Wishing Well breast growth sequence. Each file has a ComfyUI workflow embedded. If you use ComfyUI, you can drag each file into your workspace to see all of the prompts and settings I used for each segment of the video. The files are the 16fps full video at each stage, so it's the same material, just more and more of it with each extension. The final file is the 16fps version of the published video linked above.