Newer Guide/Workflow Available https://civitai.com/articles/2379
This is an old guide but does things a different way. Some of the detailed discussion on it is still worthwhile but its dated.
So, you want to do some Vid2Vid AI Conversions! I hope this guide helps you get started or improve your workflow. I am going to try to be as through as possible and if you are new, I have provided an example that you can work thorough with me to help you on your way. I also hope this can help with terminology. If you are a seasoned Vid2Vid maker I have summarized the settings at the end of the post (search for “advanced discussion”).
[Like everything with SD things are changing all the time I have put in square brackets things for me to do and you can give me input as well as I am happy to update over time]
There are lots of ways to do video conversions using stable diffusion. I provide here my method which I think is the best to provide clean transformations. There are however other options:
1/Tokyo_Jab – has the best outline for a method using EBSynth and TemporalKit. (https://www.reddit.com/r/StableDiffusion/comments/11zeb17/tips_for_temporal_stability_while_changing_the/)
2/Deforum – Open source and good at developing trippy videos – I have not used this too much (https://github.com/deforum-art/deforum-stable-diffusion)
3/Warpfusion – This is paid so I have never used it. However, it has some of the best optical flow implementation with large movements. However, I have never seen a clean transformation from it as of yet – everything is always changing and shifting. I think it incorporates something like EBSynth as there is the same kind of shifting that occurs if you pay attention.
For a look at what you can create using my method have a look at my YouTube (https://www.youtube.com/@Inner-Reflections-AI)
Defining the problem – The Types of Flicker and possible solutions in SD created Videos:
Before we fix the problem, we need to define the problem. If you are completely new to SD video conversions you can come back to this section later to help you troubleshoot/understand why things did not work out perfectly.
1/Concept flickering – Before we had Controlnet there was little hope for the kind of conversion that I was looking to do. No two frames looked similar or you went with a classic morphing deforum effect. You can google some old deforum videos here to see what I mean. [If anybody has a link to a good example of this please comment below]
The Solution: The great solution here is using controlnet to help guide the transformation. That is what this guide is about! However, it relies on the preprocessors which is directly related to the quality of your input video. The other solution involves locking the seed.
2/Location flickering – Because SD creates random noise and we have locked the seed. As things move through the frame things can shift dramatically. Watch the poster in this video: https://youtube.com/shorts/3zruIV-Ac1Q . You can see as the camera moves the poster changes shape.
The Solution: There is no perfect one and this is a big barrier to flicker free videos. Warpfusion uses an optical flow method which helps (but causes its own instability I think). The EBSynth method allows for fairly seamless transitions here but does not allow for quick movement from what I can see. If your character is moving enough it can compensate for the flicker that occurs here if you keep things stable. Once again – good source material (ie. A stable camera) can help a lot with this.
Another theoretical solution is the img2img alternative test (it’s a script in the img2img tab) which uses the source image to generate the noise for the conversion. I have never made it work well however (the default setting are also really bad – check the sigma noise option – it improves things a lot). It ends up making everything seem more random.
3/Prompt Flickering – This is something I only recently realized. I used to think that prompting was much less important than I do now. With Vid2Vid conversions you want to control the new information you add especially if the model you use does not have a consistent concept of what its changing. You can also sometimes stop flickering by describing what is there, this can have the unintended effect of stopping something from transforming. [To Do: Find or make a good example of this]
The Solution: Understand that prompting for video is not the same as doing image prompts. Some prompts may cause instability on their own and require you to troubleshoot it. You can often copy prompts done for images and then adapt them for videos. Longer or shorter prompting does not necessarily make a difference, but I usually use shorter prompts as its easier to identify what might be causing an issue.
What you need:
If you want to follow along with the tutorial you will need
1/A1111 (if you need a guide to do this go to https://stable-diffusion-art.com/ and look at the quick start guide – I would advise doing some regular txt2img and prompting to get familiar with this)
2/Controlnet Extension – Download at least the IP2P, lineart models and Temporalnet (Available at https://huggingface.co/CiaraRowles/TemporalNet [Download " diff_control_sd15_temporalnet_fp16.safetensors " also download "cldm_v15.yaml" and rename it "diff_control_sd15_temporalnet_fp16.yaml" as this will prevent an error showing up each time you use it (it still works without)])
3/ Realistic Vision 5.1 (https://civitai.com/models/4201/realistic-vision-v51 ) - you will need to make sure A1111 is using this model for this tutorial
4/Flowframes (https://nmkd.itch.io/flowframes )
5/To note my graphics card has 12GB of Vram I checked my useage and am using 10 or so with all 4 controlnets enabled.
One setting you need to change in A1111 is “With img2img, do exactly the amount of steps the slider specifies” and you want this on.
The other setting you need to change is being able to have multiple controlnets. In this case we use up to 4.
Choosing a good Video/Preparing for conversion:
For our tutorial we will be choosing this copyright free video from pexels (a great website to get source material): (https://www.pexels.com/video/a-woman-in-yellow-top-wrapping-a-yellow-sweatshirt-around-her-shoulders-3761571/ )
What makes a good video? I am sure there is more detail to this than I know. But I will list the following:
1/ Camera shot - a stationary camera is definitely on the list. Some camera motion laterally is workable too. A camera rotating around something is very difficult (due to location flickering noted above). Less movement and slow movements are also easier.
2/Camera distance – If you work with SD enough you know if you are working with a subject that is too small on the image you get less than beautiful face. Video is no exception. (see the second scene in my harry potter conversion and compare it with the shot of Hermoine https://youtu.be/CHdSp5nz6W0?si=s1-iwZAkT6zaAXTS ). You can compensate faces with an extension called AD detailer but you do have to be careful with the settings or it can be a source of flicker.
3/Textures/Patterns – Simple textures and patterns are best, especially if they are moving around. This is due to the fact we are preprocessing the frames for controlnet automatically. If the preprocessors start picking up things inconsistently then it will be a source of flickering in and of itself. Also SD loves to change small things and if you thing looks like something you prompt it will choose to convert that too.
4/Clear subject – SD (1.5 at least) is heavily trained on portrait photography and the anime. Multiple people usually are not too much trouble. The clearer the subject(s) are the better
You will need a way to split the video into frames. There are several ways to do this. I typically use a saver node in Davinci Resolve. EZGif is a good website I have used previously (https://ezgif.com/video-to-png ). For the tutorial I have split and reduced the video for you (it was 50 fps!) (https://drive.google.com/file/d/1A5N-_VIsilnBlYT__xFViwybrWINVVig/view?usp=sharing )
You then can delete frames so that you keep your fps around 6-15 fps based on preference. For this example I have reduced by 1/3 so ~16 fps.
[Commentary: It is worth mentioning here that one way to reduce flicker is by decreasing the frame rate. 6 fps may be acceptable for some uses (background, slowly moving subject). EBSynth uses this method by requiring frames every so often and interpolating some frames, there is still flicker but if its happening every ½ second or so it can be barely noticeable. Please be wary of any ‘new tech demo’ or method that uses a slow-moving example especially combined with a short duration video. In my experience most of these do not seem to pan out.]
Preparing a base conversion/settings:
When starting making Vid2Vid conversions you may be tempted to do one of two things. The first is to use very low denoising strength – this results in an ‘anime’ style image but it is simply an artifact of stable diffusion blurring the image/person. The other is to think that in order to do a conversion you have to use a denoising strength of 1 – this may result in more instability than you need to make the transformation that you want.
Similarly, you may be tempted to max out on controlnet strength – after all it would seem to make sense that you would want the video to flow of the video as close as possible right? All you end up by doing is making it harder for the AI to convert the video and you will get into a battle between denoising strength and your contolnet settings. (That said perhaps my current settings are a bit low in some places – it takes a lot of trial and error to figure out and perhaps you will have the next big breakthrough)
Choosing an output resolution is actually also really important. Making high resolution output frames will increase stability but also reduce the overall level of conversion. This may require more denoising strength to compensate.
The level of stability as I currently see it is a combination of your prompt, controlnet settings, resolution and denoising strength. What you are looking to do is to create a space for the AI to follow your instructions without coloring outside the lines so to speak.
For those following along my tutorial. If you downloaded the frames pack you will see an input keyframes folder. You can use the base controlnets png in png info and send it to img2img for the base settings. You will be doing a batch conversion – so setup this.
Unfortunately, the controlnet settings wont auto populate so you will have to do this manually. For all controlnets enable and do use pixel perfect (I am not sure 100% how necessary pixel perfect is but its what I use). You will be starting with 2 controlnets enabled:
1/IP2P set at balanced and Control weight of 0.35 (change it to batch mode – I am not certain this is necessary but I do so anyway)
2/TemporalNet set at balanced and weight of 0.4 (batch mode also)
For those who did not download the files
Prompt: a marble statue, female, (tanktop:0.5)
Negative prompt: nipples
I chose this prompt because RealisticVision has a very clear concept of marble statues that is consistent. The other parts is to keep this SFW.
Without any Controlnets you will make a video like this (https://youtu.be/AgW_z1Lx24M )
With your base conversion you should end up with something like this (https://youtu.be/bByEsGYgqlI )
In my current workflow I only very rarely change these two control nets this is the time to work with your prompt and denoising strength. This is the time to be creative! Usual denoising strength ends up being between 0.5 and 1, usually around 0.75. If you are at 1 and cannot get the transformation you are looking for you can try to increase the CFG scale – usually if you are pushing it here you are not going to end up with a very stable result. If you cannot make it work its possible what you want to do and what my method/the video allows is not possible. It is fine to have a video that is still not fully stable here.
Ironing out the issues – adding further stabilizers:
This is the point where we start troubleshooting our ideas and going for as much temporal consistency as we can. You seldom can make something perfectly stable with the base above but it’s the jumping off point so that you are only changing 1 thing at once. In this circumstance I decided to add the base lineart processor. For a discussion on these I have put a dedicated section on this below.
This is the point where you can add or remove things from your prompt. Sometimes the AI does not ‘recognize’ something that you don’t want changed. You can prompt for it but beware it may change how the transformation works. This is why you see me reduce the strength of some of the keywords. If you want to see something horrifying add ‘hair’ to the positive prompt and suddenly your otherwise marble statue will have a regular head of hair.
If you are following along with the tutorial add a 3rd controlnet, pixel perfect with the default lineart preprocessor controlnet at a weight of 0.6, balanced and batch mode as above. You will end up with a video like this (https://youtu.be/WtvYpTD0-mI ).
You will spend time here adjusting your prompt and 3rd controlnet settings to give you the transformation you want. Sometimes you have to adjust your denoising a bit too. If you are finding that you are needing to drastically increase it – this is probably because your controlnet settings are too strong. Do try the controlnet/prompt is more important settings too – the results are not always intuitive and sometimes it can give you better results than you expect.
For the tutorial here I decided that my current settings left too much of the grey lines from the marble also it looked a bit flat. From experience adding shiny will change the texture of the statue and make it a bit more dynamic. I also reduced the strength on the tanktop token as it was making it a bit too realistic. You can use png info to change the prompt from the “Added shiny keyframe”. Your video will now look something like this [Insert link here].
I do not always use loopback controlnets – the main reason is that there is not any exceptional ones. However, some have their uses depending on the video and can help you get a bit more stability that you otherwise would not have. They can help especially if some part of the conversion is getting inconsistently colored. They can also help turn a flickering portion to something that morphs slower and can be unnoticed/more pleasing to the eye. To get a feeling of what I am saying look at the background of this video (https://youtube.com/shorts/PcdYvrh4PFA ) I the result is not stable but more pleasing to the eye than things flickering in and out. For an advanced discussion on this see the end of this document.
In this case however a loopback control net is helpful if you are following the tutorial do the following.
Enable your 4th controlnet – Reference – reference_adain+attn. Keep it in single image mode and put the keyframe (“3 - Added Shiny”) as this image. Control weight to 0.6 and make ‘Controlnet more important’. Now your video will look like (https://youtu.be/tb-veBsRtmE ).
Not bad! You have now made your first Vid2Vid AI conversion.
Whew! Are you exhausted? Now that you are here however you have more or less stabilized the video and now can play around with different prompts/denoising strength. For the rest I did not generally need a loopback controlnet and kept with the first 3. Its really as simple as troubleshooting denoising strength and prompt and deciding on using loopback or not. You have 2 more presets in the Input keyframes I gave you.
The crayon present requires this Lora (https://civitai.com/models/120853?modelVersionId=131468 )
The anime preset requires that you switch your model to DarkSushiMix Colorful (https://civitai.com/models/24779?modelVersionId=56071 )
Unprocessed Anime Output: https://youtu.be/wDiHpE56oCQ
Unprocessed Crayon Output: https://youtu.be/8gtnC6-uoEI
You can combine things like the statue prompt with the anime checkpoint to get: https://youtu.be/XznP1MP1A7o
The Final Step - Post Processing:
Now that you have your output frames you need to stitch them together into a video. There are plenty of ways to do this but you can take your output folder and have flow frames interpolate from 16 to 32 and will make it a mp4 all at once. EZ Gif can also be used.
I do often do a pass of deflickering by davinci resolve but that is not available in the free version. It does not help with all flickering but can help where there is some color inconsistency, I usually use the fluro lights setting. I am not always sure it changes things much.
You can also consider deleting aberrant frames or doing some quick image editing over the parts that end up being unusual. Sometimes there is a frame that just won’t cooperate and a quick edit in paint prevents the flicker.
If you want to see how the videos look fully processed (however I did not aberrant frame correction): https://youtu.be/pCNC-324dPY
And you are done! I hope you enjoyed going through this with me and hopefully this gives you a base by which to start your own exploration into Vid2Vid conversions! Please be responsible!
Advanced discussion – Summary of workflow:
I cannot emphasize how important I think prompting is in a vid2vid workflow. Generally, you want to abstract the video rather than add details where you can.
1/IP2P at 0.35 and TemporalNet at 0.4 – adjusting prompt and denoising strength to find the minimum denoising strength that gives the desire transformation. Can increase CFG if at max denoising strength if things are not doing what you want them to but likely will make things more unstable
2/Add a stabilizer controlnet – See discussion below on these. Also prompt for areas of inconsistency
3/Consider Loopback options if needed.
4/Flowframes and a pass through deflicker on davinci resolve
Advanced discussion – Control nets for Vid2Vid transformations:
I have tried many permutations of control nets – some very strong others not. Trying to keep an open mind with what works is what has helped me the most.
Base Controlnets and why I use them:
1/Temporalnet – this is by far the most helpful – it particularly helps in keeping background objects from flickering
2/IP2P – you probably are surprised about this one and it is my newest addition – I think it allows for more transformation at a lower denoising strength meaning less flickering – I have not tested it 100% but the results I am getting are good enough to keep it
3/Tile – I have used this in the past but feel that generally this is the same as decreasing denoising strength. I am happy to be proven wrong here. It has the interesting effect of unblurring things if the original video has a blurry background (it makes sense given this was its training data).
1/Depth - The least invasive of all and can help stabilize things without making things convert back to the original video not always helpful enough
2/Lineart – I use this a lot as it is a balance between good at guiding/reducing flicker while also not being so heavy handed.
3/Softedge – Very strong and often makes the video turn back to the original – It can be used to some benefit if you make prompt more important.
4/Openpose – can help with eye position sometimes.
I have messed around with the other controlnets but did not find any good use for them as of yet. The QR code controlnet seems like it might be useful but I have not tested it too much.
There is not great option here for us right now. I hope somebody develops a tool/controlnet to help.
1/Temporalnet – can be used at low strengths to helps stabilize things – tends to cause trailing and can cause other issues but is an options
2/Reference_only – using this on balanced with 0 style Fidelity can actually do a lot to stabilize a background sometimes – even one that is hallucinated
3/Reference_adain+attn – Balanced with low style fidelity or at a low strength with controlnet more important both can help. Neither works as well as you would think. Beware with all refence control nets if you start getting alternate light and dark frames it is because your strength is too high (usually style fidelity set too high). You can also do a frame with no loopback which can help a lot too.
I hope you enjoyed this tutorial. Feel free to ask questions and I will do my best to answer. If you did enjoy it please consider subscribing to my channel (https://www.youtube.com/@Inner-Reflections-AI) or my Instagram/Tiktok (https://linktr.ee/Inner_Reflections )
If you are going to copy this on your blog or use this on a video, please consider attributing me by calling this the Inner-Reflections method and linking to my YouTube or linktree accounts at the top of the post.
If you are a commercial entity and want some presets that might work for different style transformations feel free to contact me here or on my social accounts.
If you are would like to collab on something or have questions I am happy to be connect here or on my social accounts.