[A1111] AnimateDiff with IPAdapter and OpenPose

First of all, this operation consumes a lot of VRAM. When I created a short video, after loading ControlNet calculation, the whole VRAM ate up about 16GB. If you don't have enough VRAM, it's better to use ComfyUI's method.

A1111 with AnimateDiff

The installation is similar to ComfyUI, please install the following Extensions first.

It is also highly recommended to install FreeU,

FreeU

AnimateDiff models

Motion models and Motion Lora can be downloaded here,

https://huggingface.co/guoyww/animatediff/tree/main

In addition to recommend a few, can be used to generate animated action models,

These model files need to be placed in this folder,
extensions/sd-webui-animatediff/model

If you are downloading the action Lora, you need to put it here,
extensions/sd-webui-animatediff/model/Lora

After downloading, if you need to use ControlNet, please put the files you need here,
models/ControlNet

Of course, your main model needs to be in models/Stable-diffusion . If you have VAE, you need to put it in models/VAE .

AnimateDiff's WebUI

Once everything is installed, you will have an AnimateDiff interface,

Enable AnimateDiff Remember to check the box if you want to enable it.
Motion module Motion module is the place to select the motion model.
Save format Save the output format, TXT can be checked if you want to save text information.
Number of frames Default is 0，when you use Video source or Video path , it will be calculated automatically and filled in. Otherwise, it will calculate the number according to the Context batch size . Please don't use a number close to the Context batch size Context batch size, the kit author explains in #213 .
FPS The number of frames per second.
Display loop number If you choose to output GIF , this means how many times you want to replay, default 0 means replay continuously.
Context batch size Indicates how many frames AnimateDiff will process at a time, 16 is a good value by default. Please note! Different action models will limit the maximum value of this number.
Closed loop tries to make a loop animation, there are four algorithms, it should be noted that when the Number of frames less than or equal to the Context batch size , it will not work.
- N Not used, valid when Number of frames(except 0 ) is less than the Context batch size .
- R-P Reduces the amount of context in the loop animation and does not interpolate when using Prompt Travel.
- R+P Reduce the amount of context in the loop animation and use Prompt Travel to do Interpolated.
- A The beginning of the frame will be connected to the last frame to do the loop animation, using the Prompt Travel way to do the frame insertion action (Interpolated).
Stride This thing is hard to explain, the main purpose is to make sure AnimateDiff can keep time synchronized between frames as much as possible, by default it uses 1. You can refer to the original author's description of WebUI Parameters for more details. The same as ComfyUI, this setting seems to have no effect on vid2vid.
Overlap Sets the number of frames to be reserved for overlaying each time AnimateDiff processes an image, by default -1, it will use the number of Context batch size / 4 . This setting is only valid if Number of frames is greater than the Context batch size or Number of frames is 0 .
Frame Interpolation Default is OFF, If you are using FILM , you will use Deforum to make up the frame after AnimateDiff.
Interp X When using Frame Interpolation = FILM , takes the X number of output frames and adds them to the input frames. It will stretch the whole movie and you need to speed up the movie after output.
Video Source You can upload video here.
Video Path If you don't have any video to upload, you can input the path of the video here, you need to split the video into frames and put it here.
Move motion module to CPU (default if lowvram) In case of low VRAM, move the model to system memory.
Remove motion module from any memory Remove motion module from any memory.

After we upload the movie, some of the above parameters will be brought in automatically,

ControlNet

Next, we need to prepare two ControlNet for use,

OpenPose
IPAdapter

I'm using IPAdapter here, and I've chosen the model ip-adapter-plus_sd15 and set the weight to 0.7 to avoid too high weight to interfere with the output.

In addition, I have prepared the same number of OpenPose skeleton diagrams as the uploaded movie and placed them in the /output/openpose folder for this ControlNet to read,

If you have no idea to generate the skeleton images, you can also using the ControlNet preprocessor like DWPose to processing the image to the skeletons.

If you are using the preprocessor, keep in mind the batch input images are export from your video using other tools, like FFmpeg or etc. You need export all frames from your video.

Text2Image Settings

Please pay attention to theBatch sizes, which are explained a little bit here,

The difference between A1111 and ComfyUI is that ComfyUI counts the whole process as one Batch Counts, so in ComfyUI, you need to set the Batch sizes to be the same as the number of frames read in, so the animation won't be unsmooth.
In A1111, according to the Number of frames read by AnimateDiff plug-in, and the source of ControlNet OpenPose you prepared, take this article as an example, you will have 50 drawing steps.
If your Batch sizes / Batch Counts are set to 1, then all of T2I will only be done 50 times.
For example, in the video that follows this post, the Batch sizes is set to 4, which means that the T2I process will generate 50 x 4 = 200, a total of 200 images.
~~Since AnimateDiff has modified the i2ibatch, AnimateDiff will take these 200 images to do the algorithm in the image processing part (~~~~whether or not this is actually the case needs to look at the source code in detail, it's just my personal speculation at the moment~~).

According to the extension github write-up, modifying Batch Sizes has no effect at the moment, maybe it will be supported in the future.

You do not need to change batch size at all when you are using this extension. 

We are currently developing approach to support batch size on WebUI in the near future.

Please keep the Seed setting at a fixed value, because you don't want to output pictures with completely different styles.

ADetailer

Finally, we use the ADetailer to repair the face,

I lowered the ADetailer's Inpaint denosing strength to 0.25 to make sure the face doesn't get overpainted and cause flickering problems.

Generating

For the prompt part, since we are using IPAdapter, we can throw the IPAdapter image to PNGInfo to get our prompt, and then do the modification for the prompt.

Finally, let's take a look at the output,

Animate GIF: https://i.imgur.com/2rNP0SX.gif

These three videos have been processed separately, the output FPS is 16FPS, the rest of the different output settings are as follows,

Frame Interpolation = NO, Batch Size = 1
Frame Interpolation = FILM, Batch Size = 4
Frame Interpolation = FILM, Batch Size = 4, Stride = 4

You will find that the overall smoothness of the animation will be significantly better when using a higher Batch Size .

Batch Size

When you don't use vid2vid, but use text output to GIF, your Batch Size will determine the number of words in the GIF (not the Batch Counts), which is also explained in the author's Batch Size section.

If you are using ControlV2V, then increasing the Batch Size a little bit will bring you good results, you can try it more often.

Frame Interpolation = FILM / Intrp X

When turning on Frame Interpolation = FILM , for the purposes of this post, I used the default value of Interp X = 10 , The Number of frames read by the original movie is 50, and we end up with a GIF file that contains a total of 491 frames.

In other words, each frame of our original video is added about 10 frames after enable, so the final output GIF file will become a slow motion video.

The actual code operation is like this,

film_in_between_frames_count = calculate_frames_to_add(len(frame_list), params.interp_x)

Source code: film_in_between_frames_count.

So, the original video is about 3.2 seconds, and after the end of the frame-up action, it will be stretched into a 30.2 seconds video, and you will need to use other tools to shorten the whole movie back to the original length.

If you have installed FFmpeg, you can use setpts to speed up the movie. In the case of the movie in this post, this will shorten it to the same speed as the original movie.

ffmpeg -i input.mp4 -filter:v "setpts=0.1*PTS" output.mp4

Additional Settings

If you're using --xformers and you're having problems, you can adjust the settings of AnimateDiff to use sdp instead, which may solve the problem.

Another point is that because it uses vid2vid with t2i, if you have problems generating images, turning on this setting may solve them, and it will also bring a little performance optimization.

Conclusion

The operation of A1111's AnimateDiff and ComfyUI is actually not very different, the only difference is that A1111 has packed the intermediate places that need to be linked, which can save some time.

If VRAM is really tight, it's better to switch to ComfyUI, or else the length (total frames) of the movie will be lowered each time, and then you can use the movie editing software to string the files together.