Sign In

LLM-Enhanced Video Workflow

12

LLM-Enhanced Video Workflow

Turn a few words into polished images and videos

LLM-driven, multi-stage diffusion made simple

Imagine generating fully polished images and videos from just a few words โ€” without spending hours refining prompts.

This workflow combines:

  • LLM-based story generation

  • Structured diffusion sampling

  • Multi-stage iterative upscaling

  • WAN 2.x video generation

The result is a system where minimal input becomes structured, coherent, and visually strong output โ€” while keeping anatomy stable and artifacts under control.

This workflow is intentionally not visually โ€œpretty.โ€
It is structured for readable control flow rather than aesthetic node alignment.


Core Concept

If you enter a very short prompt, for example:

1girl

the LLM will generate a structured story about the subject. It will invent details such as her appearance, clothing, expression, pose, environment, lighting conditions, mood, and contextual elements. The shorter your input, the more creative freedom the LLM has. The more specific your input, the more the generated story reflects your exact intentions.

Your original prompt is never replaced. It remains part of the final positive prompt. The LLM output is appended and merged with it before sampling. This means you do not lose information by using the LLM layer.

Scaling modifiers such as (happy:1.2) still work exactly as expected. They are passed through to the sampler and influence weighting normally.

There are two separate input fields for both image and video generation. The description field is visible to the LLM. The keywords field is not. The keywords field is appended after the LLM output and is ideal for art styles, LoRA trigger words, or technical modifiers that you do not want the LLM to reinterpret.

This separation allows clean structural generation while keeping stylistic control precise.


System Requirements and Setup

This workflow was tested on a Ryzen 7800X3D, 64GB RAM, and an RTX 4090. For WAN video generation, 24GB VRAM is strongly recommended.

Install Ollama first. Then open a terminal and run:

ollama run mistral-small3.2

This installs the mistral-small3.2 model. Keep Ollama running in the background while using the workflow.

After loading the workflow in ComfyUI, open ComfyUI Manager and click Install Custom Nodes. Missing nodes will break parts of the pipeline.



WAN 2.2 Dependencies

For WAN video generation to work correctly, the following files are required:

If any of these are missing or incorrectly connected, WAN video generation will fail or produce broken output.


Important โ€“ WAN 2.2 Model Connection

Currently, the workflow is configured with UNET loaders (GGUF) connected.

If you are using a safetensors WAN 2.2 model, you must:

  • Disconnect the GGUF UNET loader

  • Connect the Load Diffusion Model nodes directly to the Video Lora nodes

Otherwise, the video pipeline will not work correctly.

This only applies if you switch to a safetensors version of WAN 2.2.


How to Use the Workflow

For image generation, enter a short or detailed description in the Image Description field. This is what the LLM sees. In the Image Keywords field, add LoRA triggers, art styles, or special tokens that you want appended after the LLM output.

You do not need to add quality modifiers or negative prompts. Those are already included inside the nested workflow components.

The NSFW toggle controls output type. Set it to 0 for safe content or 1 for NSFW output.

For video generation, the same logic applies. The Video Description is processed by the LLM. The Video Keywords field is appended afterward for stylistic control.

The Megapixels for Video parameter controls the resolution of the generated video. A value of 0.66 works reliably and provides a good balance between quality and performance. Higher values increase VRAM usage.

For video length, 5 to 6 seconds tends to produce stable results. At 7 seconds or more, the model may start introducing looping artifacts or scene repetition.



Image Generation and Upscaling Pipeline

The upscaling pipeline evolved over time. The goal was to improve detail while preserving anatomy and avoiding distortion, especially in hands and faces.

The process begins with selecting an aspect ratio. An empty latent at 1 megapixel is created and padded to 32 pixel alignment. Alignment to 32 pixels is important because non aligned resolutions can introduce border artifacts.

The base image is generated using 69 steps with Euler Ancestral and the Normal scheduler. Euler Ancestral produces strong structural foundations and dynamic compositions, which makes it well suited for the initial generation stage.

2026-02-27-225837_ijsense_v10_0.png

The first upscale increases resolution by a factor of 1.4 using two iterative steps at 24 sampling steps each, with Euler and the Linear_Quadratic scheduler. Denoise is set to 0.21. This stage allows moderate refinement while the image is still small enough to prevent large scale distortions.

2026-02-27-225850_ijsense_v10_0.png

Next comes face and hand refinement using DPM++ 2M with the Simple scheduler, 16 steps, and a denoise value of 0.21. DPM++ 2M is particularly good at preserving structure while improving micro detail. It helps stabilize anatomy and correct small irregularities without shifting the composition significantly.

2026-02-27-225910_ijsense_v10_0.png

After that, a second controlled upscale increases size by another factor of 1.2 in two steps. Because the image is now significantly larger than typical training resolutions, this stage must be conservative. DPM++ 2M with Simple scheduler is used again, with 16 steps and denoise set to 0.2. The goal here is gentle refinement without introducing high resolution artifacts.

2026-02-27-225925_ijsense_v10_0.png

The final smoothing pass uses DDIM with the Simple scheduler, 12 steps, and denoise at 0.18. DDIM is stable and predictable, making it ideal for subtle finishing touches.

Upscaling tends to slightly desaturate images. To compensate, the final step performs color matching against the original base image to restore vibrancy and contrast.

2026-02-27-225931_ijsense_v10_0.png

Through experimentation, the most reliable pattern has been multiple small controlled upscales rather than one aggressive jump in resolution.


Sampler and Scheduler Insights

Euler Ancestral is excellent for the first pass because it encourages variation and strong structural emergence. It can introduce creative diversity while still forming a coherent base.

Euler without ancestral noise works better for controlled refinement. It reduces large structural shifts and is predictable.

DPM++ 2M performs well during detail enhancement stages. It maintains anatomy and fine structure better than many alternatives when working at higher resolutions.

DDIM is less aggressive and works well for final smoothing when you want stability rather than reinterpretation.

Regarding schedulers, Normal provides balanced behavior during initial generation. Linear_Quadratic smooths the refinement curve and helps avoid sudden tonal shifts. Simple scheduler is consistent and stable during micro refinement.

In general, denoise values around 0.2 appear to be a sweet spot for iterative upscaling. Higher values tend to break anatomy at large resolutions.


Video Pipeline

For video generation, the final generated image is first downscaled to the megapixel value defined in the Video settings. It is then aligned to 32 pixels to prevent border artifacts.

The WAN 2.2 workflow is then executed. Tiled VAE decode is used because the standard decode can run out of memory at higher resolutions.

After decoding, the video is upscaled and RIFE frame interpolation is applied to improve motion smoothness. This produces more fluid animation and reduces visible stepping between frames.



Favorite Base Models

However, I will say just go to https://civitai.com/user/reijlita/models and download any model, they are all awesome!

Favorite LoRAs

Just try whatever you want.


Final Thoughts

This workflow focuses on clarity of control flow, structured prompt expansion, and conservative high resolution refinement.

Minimal input is enough to produce rich output. Detailed input is preserved and respected. The LLM does not replace your prompt, it enhances it.

If you prefer direct manual prompting, you can still use the workflow without relying heavily on the LLM. But when used as intended, it significantly reduces prompt micromanagement while improving scene coherence and anatomical stability.

12