Mochi1 running with ComfyUI [Local Video Model]

UPDATE Image to Video Encoder is released ! use the bookmarks on the right
Runpod Template now available - scroll to the bottom to find more information !

V7 Showcase AMV

V7 Tutorial Guide (featuring ACE-HoloFS Video PromptGen explainer)

Mochi is a groundbreaking new Video generation model that you can run on your local GPU. It used 20GB of VRAM, which sound like a lot, but the authors originally ran it on 4xH100 (100GB VRAM) so this is a HUGE optimization.

Special Thanks to the man of the hour - Kijai, who wrote this wrapper super fast so we can try this model in comfyui. This article will detail what you need to download and where to put the files so you can start using this model.

Find my Workflow Pack here: Donut-Mochi-Video-Pack

https://civitai.com/models/886896

Original Project Github (Authors): https://github.com/genmoai/models

Mochi Wrapper for ComfyUI: https://github.com/kijai/ComfyUI-MochiWrapper

Mochi Preview, Comfy Models: https://huggingface.co/Kijai/Mochi_preview_comfy/tree/main

WEIGHTS

mochi_preview_dit_fp8_e4m3fn.safetensors

place inside: ComfyUI\models\diffusion_models\mochi

VAE

mochi_preview_vae_bf16.safetensors

place inside: ComfyUI\models\vae\mochi

note from Author: Work in Progress!
If you want to install this, you must use "git clone https://github.com/kijai/ComfyUI-MochiWrapper" while inside the \custom_nodes\ folder.
Much like the CogVideo, consider this a preview for now!

"Requires flash_attn !"

~~More information on this will be added, seems like some windows TorchCUDA builds were missing this.~~ Update Comfyui + Torch dependencies !
Big Thanks to Kijai for this latest update!

If you run Torch 2.5.0 (latest at time of writing) this will now run without the lengthy building of Flash Attention which a lot of people found very difficult

VAE Encoder (image 2 video)

(research thread here: https://github.com/kijai/ComfyUI-MochiWrapper/issues/26#issuecomment-2453119922)

You will need:
https://huggingface.co/Kijai/Mochi_preview_comfy/blob/main/mochi_preview_vae_encoder_bf16_.safetensors
place inside: ComfyUI\models\vae\mochi

UPDATE: Improving Outputs;

decoder settings:
frame_batch_size = 16
tile_sample_min_width = 144
tile_sample_min_height = 80
tile_overlap_factor_height = 0.3
tile_overlap_factor_width = 0.3

We get good results using CFG 7.0 in the ksampler

Also consider using the SD3.5 T5XXL_FP7_e4m3fn_scaled Clip model, this will also improve the outputs without increasing steps, and only increase gen times from 20 - 24 minutes.

By forcing the T5XXL-FP16 onto the CPU and then using the Q8_0 model for inference we increased the quality at 50 steps even further, however the ghosting effect is more noticeable, so i have solved this in the V6 version

In the upcoming V6 pack we have dropped the VAE Tiling by saving the result from the ksampler as .latent files. This eliminated a lot of overhead in terms of compute. While it is no faster, the decoding stage can be done separately, which takes around 30 seconds. Also Tiling VAE caused ghosting artifacts which you can see in the video outputs.

I have updated my DJZ-Nodes & DJZ-Workflows pack on GitHub to include this new approach. Results will be added to the V6 Gallery soon - the videos are cooking as i write this.

I have created a new Image Size Adjuster (V3) with an option for Mochi1 Preview, which sets the resolution to 848x480 (16:9) and will automatically switch to tall aspect ratio if you want to use a 9:16 image. Also with the loading of latents you normally have to copy the saved latents from \input to \output which i found a little awkward, so a new node was created.

DJZ-LoadLatent is designed to scan the comfyui output path for .latent files. Simply press "R" and it will reload the node and scan for new latents, making it much easier and faster to decode them separately. Full instructions are shown in the workflow notes.

The testing continues, we can always increase the steps to 200, but this can increase the gen times to 45 minutes, this is not surprising when the model was built to run with 4xH100 !

check my workflow pack for more updates as they land ! (link above)

RUNPOD TEMPLATE

What am I Doing ??
I'm using Runpod only to cook my latents into video

Why ??
Because i can do the part where we sample the .latent to disk on my PC over night (saves time in the runpod), if it takes 10 minutes to save the .latent and only 30 secs to cook .latent into video, this is a HUGE saving
and you can't run out of memory with 48GB

https://runpod.io/console/deploy?template=egyeo55x8w&ref=0czffee4
ComfyUI Mochi Runpod template

I'm using it like this:

used local to create samples saved as .latent
starting Runpod template with L40 (48GB)
uploading all my latent files to the /output folder
loading the Donut-Mochi-848x480-t2v-LatentSideload-v25 workflow

first run takes a long time, so don't forget to connect your network storage to escape that wait time on second run

Refresh the Comfyui (press R)
choose the .latent from the list in my new DJZ-LoadLatent node
Queue it up !

this way you can take advantage of the power of the L40 to crush those latents into Video in record time, reducing the cost of the Runpod session by 95%

there will be a oneshot runpod workflow that can do it all in one go soon (for those that can't use local), but this is the cheapest solution if you can run the sampler on your local.

https://i.gyazo.com/d8b3f4e54b0b8642f3e430cfa3b2f2d3.mp4 <- decoding process takes 15 seconds.

~~I created 48 videos overnight, so there will be a large gallery in my next report.~~

added to the V6 workflow pack gallery:
https://civitai.com/posts/8455626

~~if you are interested the research thread on Github is here:~~
https://github.com/kijai/ComfyUI-MochiWrapper/issues/26