AI VIDEO - first/last frame image 2 video - CogVIDEO-X-FUN 1.1 - The Best I've Tested So Far - v1.0 Showcase
Decent quality AI videos, FAST, controllable, enjoyable!
I don't usually share my workflows, but this absolutely needs to be shared.
This is based on the CogVideoXWrapped
using CogVIDEO-X-FUN 1.1 model and allows inputting two frames. The incredible potential of this went unnoticed, so I had to share it.
Basically, it makes an interpolation of two images and adds noise in between to generate movement and details in the transition between the two chosen frames. This can 100% be scaled up to create much longer, consistent videos (especially with ControlNet).
I'm working on this right now.
### Minimum Hardware Requirements:
12GB VRAM (or less; actually using around 8GB).
I'm on a 3090, and it never fills more than 16GB, even when experimenting with higher resolutions—it's all about that.
Usually, I see around 9GB occupied.
### Render Times:
On my 3090, it takes around 15 to 90 seconds for each example.
I've tested everything. You'll get the best and fastest results using this friendly workflow, especially if you've never tested Cog before.
### Infos:
You can pick any size/ratio images, vertical or horizontal. My tests show that it doesn't seem to matter.
(Pick two images with the same aspect ratio, obviously, or one will be stretched.)
The only thing that matters is the base resolution. You'll find the dedicated slider (I typically use 512, but try 768 if your hardware can handle it).
### Important Suggestions:
- To achieve a good, consistent result, the two images need to be similar (same location, people, and very close positioning of everything).
- Example use: Pick two screenshots from a random internet video or 3D animation, or even do an img2img
with the two frames.
- Stay around 10-15 steps, go higher for better quality. (it's hit or miss below that, although I’ve gotten some nice ones at 5 steps).
sometimes i got better quality at 10 steps, sometimes at 15.. you know, luck.
- For quick tests, use a lower base resolution (like 320). At that resolution it takes around 15 seconds on my 3090.
- If the results are full of artifacts, switch to "custom prompt only" to avoid auto-prompting and get more stable, consistent animations (like the lollipop example) by simplyfy the prompt. Write something simple like: "She blinks her eyes, still, camera shake."
If you want more movements, type: "She wiggles, she blinks her eyes, movement, camera shake." Experiment with prompts. Please share your findings! Words like wiggle, earthquake, blink, camera shake, and handheld camera have already been tested here with great success.
- If the video seems too fast for your settings, raise the "extra interpolation multiplier" or change the video length in the COG settings group.
### Other Considerations:
I've tested this A LOT and changed the values from the standard settings to something I think works better, at least based on my tests in the last 48 hours.
Feel free to make your own changes (and if you find better settings, please let us know)!
### Settings Adjustments:
- Video Length: 25 (the standard was 49) – this greatly changes how the result behaves.
- Base Resolution: from 320 to 768 (the standard was 512) – this increases consistency in small details like hands and eyes but changes overall behavior.
- Scheduler: LCM (the standard was DPM++) – this may allow lower steps.
- Steps: from 5 to 20 (the standard was 30) – this speeds things up at the cost of quality.
- Cfg: 5 (the standard was 6) – again, behavior varies.
- Noise Augmentation Strength: 0.2 (the standard was 0.0563) – this is mysterious and behaves strangely.
- Cog Prompt Strength: 1.5 (the standard was 1) – output with the same seed looked cleaner in multiple tests when I raised it to 1.5.
### Required Files:
https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-InP/tree/main
all files goes in ComfyUI\models\CogVideo\CogVideoX-Fun-V1.1-5b-InP\
this workflow use also Florence to autodescribe image and improve the prompt but feel free to swap to other vision models or bypass the thing and type manually
### Update:
As I was refining and compiling this to share, the CogVideoXWrapped page released the interpolation json workflow, but it's kinda broken and missing stuff, also misleading (it force input images to stay at certain resolutions)
so I wouldn't be too sure that the standard values they gave can be considered "optimal."
Feel free to experiment with it anyway.