santa hat
deerdeer nosedeer glow
Sign In

Donut Mochi Pack - Video Generation

60
1.4k
28
Updated: Dec 9, 2024
base modelgeneratorvideomochi
Type
Workflows
Stats
52
Reviews
Published
Dec 9, 2024
Base Model
Mochi
Hash
AutoV2
9FF70B1A65

MOCHI VIDEO GENERATOR

(results are in the v1, v2, etc gallery, click the tabs at the top)

True i2v workflow added from V8 onwards, details in the main Article

video TBA

Showcase Special: (created with mostly one ACE-HOLO promptgen line)

pack update V7 + special Video promptgen guide with ACE-HoloFS.


V7 Demo Reel (made with Shuffle Video Studio)


Roundup of the research so far, with some more detailed instructions/info



Current leader: (V7 gallery) (V8 adds image encoding)
"\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchedLatentSideload-v55"
i2v version used LLM Video prompt gen, t2v used my Zenkai-prompt + DJZ-LoadLatent.

WIP project by Kijai
Info/Setup/Install guide: https://civitai.com/articles/8313
Requires Torch 2.5.0 minimum, so update your Torch if you are behind.
As with the CogVideo Workflows, they are provided for people that want to try the Preview :)

Even with a 4090 it can push the limits a little, I provide my workflows used to research Tile Optimisation in V1;

  1. We're reducing tile sizes by roughly 20-40% from the defaults

  2. We're increasing the frame batch size to compensate

  3. Maintaining the same overlap factors to prevent visible seams

Key principles:

  • Tile sizes should ideally be multiples of 32 for most efficient processing

  • Keep width:height ratio similar to the original tile sizes

  • Frame batch size increases should be modest to avoid frame skipping

Researchers Tip!
If you work with a fixed seed, the sampler remains in memory, so the first gen took ~1700 seconds, however, changes to the Decoder can be made which means that the next video will take ~23 seconds. All the work is already done by the Sampler, so unless we take a new seed it will use the samples over and over, VAE decode speed is very good!

^ subsequent gens on same seed are very fast, allowing tuning of the decoder settings ^

^ initial generation was taking ~1700 with pytorch 2.5.0 SDP ^

V1 Workflows:

outputs labelled and added to V1 gallery, test prompt used:
"In a bustling spaceport, a diverse crowd of humans and aliens board a massive interstellar cruise ship. Robotic porters effortlessly handle exotic luggage, while holographic signs display departure times in multiple languages. A family of translucent, floating beings drift through the security checkpoint, their tendrils wrapping around their travel documents. In the sky above, smaller ships zip between towering structures, their ion trails creating an ever-changing tapestry of light."


\Decoder-Research\Donut-Mochi-848x480-batch10-default-v5
= Author Default Settings

  • This version used the recommended config from Author


\Decoder-Research\Donut-Mochi-640x480-batch10-autotile-v5
= Reduzed size, Auto Tiling
- This is my first run which created the video in the gallery, simply using Auto Tile on the decoder and reducing the overall dimensions to 640x480. This reduction makes generation take less memory, but is heavy handed and will reduce the quality of outputs.

The remaining workflows are all Investigating the possible configs, without using Auto Tiling so we know what was used exactly. Videos will be labelled for the batch count and added to v1 gallery. Community research is required !

\Decoder-Research\Donut-Mochi-848x480-batch12-v5
frame_batch_size = 12
tile_sample_min_width = 256
tile_sample_min_height = 128

\Decoder-Research\Donut-Mochi-848x480-batch14-v5
frame_batch_size = 14
tile_sample_min_width = 224
tile_sample_min_height = 112

\Decoder-Research\Donut-Mochi-848x480-batch16-v5
frame_batch_size = 16
tile_sample_min_width = 192
tile_sample_min_height = 96

\Decoder-Research\Donut-Mochi-848x480-batch20-v5

frame_batch_size = 20
tile_sample_min_width = 160
tile_sample_min_height = 96

\Decoder-Research\Donut-Mochi-848x480-batch24-v5

frame_batch_size = 24
tile_sample_min_width = 128
tile_sample_min_height = 64

\Decoder-Research\Donut-Mochi-848x480-batch32-v5

frame_batch_size = 32
tile_sample_min_width = 96
tile_sample_min_height = 48

The last workflow is a Hybrid Approach, the increased overlap factors (0.3 instead of 0.25) might help reduce visible seams when using very small tiles.

\Decoder-Research\Donut-Mochi-848x480-batch16-v6

frame_batch_size = 16
tile_sample_min_width = 144
tile_sample_min_height = 80
tile_overlap_factor_height = 0.3
tile_overlap_factor_width = 0.3

V2 Workflow

\CFG-Research\Donut-Mochi-848x480-batch16-CFG7-v7

This used the Donut-Mochi-848x480-batch16-v6 workflow with 7.0 CFG
this seems to be a good setting, generation time is 24 minutes with this setup.
(pytorch SDP used)


V3 Workflows

\FP8--T5-Scaled\Donut-Mochi-848x480-batch16-CFG7-T5scaled-v8

We decided to use the FP8_Scaled T5 CLIP model, this improved the outputs greatly across all prompts tested. check the v3 gallery. This is the best so far ! (until we beat it)

\GGUF-Q8_0--T5-Scaled\Donut-Mochi-848x480-b16-CFG7-T5scaled-Q8_0-v9

This did not yield the best results, probably due to T5 scaled Clip still being in FP8 as we were testing the use of GGUF Q8_0 as the main model.

V4 Workflow

\T5-FP16-CPU\Donut-Mochi-848x480-b16-CFG7-CPU_T5-FP16-v11

used T5XXL in FP16 by forcing it onto the CPU. Seems like the same artifacts from V3 where we used GGUF Q8_0 with T5XXL FP8.

V5 Workflows

\GGUF-Q8_0--T5-FP16-CPU\Donut-Mochi-848x480-GGUF-Q8_0-CPU_T5-FP16-v14

This was the best settings with VAE Tiling enabled, increasing the steps of course will increase the quality and the time taken.

Increasing steps to 100-200 is increasing quality at the expense of time taken, 200 steps takes 45 minutes. Likely no version for this because anybody can add more steps to any of these workflows and just wait a very long time for a 6 second video. This can be remedied with a Cloud setup and more/larger GPU/VRAM allocation.

V6 Workflows

\Fast-25-Frames\Donut-Mochi-848x480-Fast-v4

Used VAE Tiling with 25 frames to generate 1 second of video. with 50 steps this takes a few minutes, 4-5 minutes for 100 steps.

\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-i2v-LatentSideload-v21

Using my new DJZ-LoadLatent Node, you can save the sampler results as .latent files on the disk, this makes it possible to decode the latents as a separate stage, eliminating the need for the Tiling VAE. This is image to video, and used OneVision to estimate a video prompt from any given image, it also automatically detect Tall or Wide Aspect ratio and crops/fills to 16:9 or 9:16. NOTE: more testing must be done to prove that Tall Aspect Quality is good.

\NoTiling-SaveLoadLatent\Donut-Mochi-848x480-t2v-LatentSideload-v25
This is the text to image version of the previous workflow, we drop OneVision and ImageSizeAdjusterV3 and add Zenkai-Prompt-V2 back in to take advantage of our prompt lists. Full instructions are found in the workflow notes.

Save/Load Latent approach allows us to drop the Tiling VAE, which introduced ghosting to all videos regardless of the settings, as we achieved improved quality the ghosting becomes more apparent.

V7 Workflows

Updated the V6 latent sideload workflows to use the newer VAE Spatial Tiling Decoder
This can run 100% on local GPU, and all the demo videos in the gallery used on 50 steps
(100 steps used in the V6 gallery) another significant upgrade !

\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-LatentSideload-v50.json

  • text2video, VAE spatial tiling decoder, with my latent loader

\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-LatentSideload-v50.json

  • pseudo image2video, VAE spatial tiling decoder, with my latent loader

\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-t2v-BatchLatentSideload-v55.json

  • text2video, VAE spatial tiling decoder, with my V2 batched latent loader

\V7-Spatial-Tiling-VAE\Donut-Mochi-848x480-i2v-BatchLatentSideload-v55.json

  • pseudo image2video, VAE spatial tiling decoder, with my V2 batched latent loader


NOTE: V7 is available on Github in my DJZ-Workflows pack, however it will not get published here until the new batch of videos are finished (cooking all night tonight)



V8 Workflows

\True-Image-To-Video\Donut-Mochi-848x480-i2v-LatentSideload-v90.json

  • image2video, VAE spatial tiling decoder, with my latent loader

\True-Image-To-Video\Donut-Mochi-848x480-i2v-BatchedLatentSideload-v90.json

  • image2video, VAE spatial tiling decoder, with my V2 batched latent loader

Added true i2v (image to video using new VAE Encoder)
tutorial video TBA. details in the main article