Turn a few words into polished images and videos
LLM-driven, multi-stage diffusion made simple
Imagine generating fully polished images and videos from just a few words โ without spending hours refining prompts.
This workflow combines:
LLM-based story generation
Structured diffusion sampling
Multi-stage iterative upscaling
WAN 2.x video generation
The result is a system where minimal input becomes structured, coherent, and visually strong output โ while keeping anatomy stable and artifacts under control.
This workflow is intentionally not visually โpretty.โ
It is structured for readable control flow rather than aesthetic node alignment.
Core Concept
If you enter a very short prompt, for example:
1girl
the LLM will generate a structured story about the subject. It will invent details such as her appearance, clothing, expression, pose, environment, lighting conditions, mood, and contextual elements. The shorter your input, the more creative freedom the LLM has. The more specific your input, the more the generated story reflects your exact intentions.
Your original prompt is never replaced. It remains part of the final positive prompt. The LLM output is appended and merged with it before sampling. This means you do not lose information by using the LLM layer.
Scaling modifiers such as (happy:1.2) still work exactly as expected. They are passed through to the sampler and influence weighting normally.
There are two separate input fields for both image and video generation. The description field is visible to the LLM. The keywords field is not. The keywords field is appended after the LLM output and is ideal for art styles, LoRA trigger words, or technical modifiers that you do not want the LLM to reinterpret.
This separation allows clean structural generation while keeping stylistic control precise.
System Requirements and Setup
This workflow was tested on a Ryzen 7800X3D, 64GB RAM, and an RTX 4090. For WAN video generation, 24GB VRAM is strongly recommended.
Install Ollama first. Then open a terminal and run:
ollama run mistral-small3.2This installs the mistral-small3.2 model. Keep Ollama running in the background while using the workflow.
After loading the workflow in ComfyUI, open ComfyUI Manager and click Install Custom Nodes. Missing nodes will break parts of the pipeline.
WAN 2.2 Dependencies
For WAN video generation to work correctly, the following files are required:
DaSiWa-WAN 2.2 I2V 14B TastySin v8
Wan2_1_VAE_bf16.safetensorsnsfw_wan_umt5-xxl_fp8_scaled.safetensorshttps://huggingface.co/NSFW-API/NSFW-Wan-UMT5-XXL/tree/main
Place in models/clip
If any of these are missing or incorrectly connected, WAN video generation will fail or produce broken output.
Important โ WAN 2.2 Model Connection
Currently, the workflow is configured with UNET loaders (GGUF) connected.
If you are using a safetensors WAN 2.2 model, you must:
Disconnect the GGUF UNET loader
Connect the Load Diffusion Model nodes directly to the Video Lora nodes
Otherwise, the video pipeline will not work correctly.
This only applies if you switch to a safetensors version of WAN 2.2.
How to Use the Workflow
For image generation, enter a short or detailed description in the Image Description field. This is what the LLM sees. In the Image Keywords field, add LoRA triggers, art styles, or special tokens that you want appended after the LLM output.
You do not need to add quality modifiers or negative prompts. Those are already included inside the nested workflow components.
The NSFW toggle controls output type. Set it to 0 for safe content or 1 for NSFW output.
For video generation, the same logic applies. The Video Description is processed by the LLM. The Video Keywords field is appended afterward for stylistic control.
The Megapixels for Video parameter controls the resolution of the generated video. A value of 0.66 works reliably and provides a good balance between quality and performance. Higher values increase VRAM usage.
For video length, 5 to 6 seconds tends to produce stable results. At 7 seconds or more, the model may start introducing looping artifacts or scene repetition.
Image Generation and Upscaling Pipeline
The upscaling pipeline evolved over time. The goal was to improve detail while preserving anatomy and avoiding distortion, especially in hands and faces.
The process begins with selecting an aspect ratio. An empty latent at 1 megapixel is created and padded to 32 pixel alignment. Alignment to 32 pixels is important because non aligned resolutions can introduce border artifacts.
The base image is generated using 69 steps with Euler Ancestral and the Normal scheduler. Euler Ancestral produces strong structural foundations and dynamic compositions, which makes it well suited for the initial generation stage.

The first upscale increases resolution by a factor of 1.4 using two iterative steps at 24 sampling steps each, with Euler and the Linear_Quadratic scheduler. Denoise is set to 0.21. This stage allows moderate refinement while the image is still small enough to prevent large scale distortions.

Next comes face and hand refinement using DPM++ 2M with the Simple scheduler, 16 steps, and a denoise value of 0.21. DPM++ 2M is particularly good at preserving structure while improving micro detail. It helps stabilize anatomy and correct small irregularities without shifting the composition significantly.

After that, a second controlled upscale increases size by another factor of 1.2 in two steps. Because the image is now significantly larger than typical training resolutions, this stage must be conservative. DPM++ 2M with Simple scheduler is used again, with 16 steps and denoise set to 0.2. The goal here is gentle refinement without introducing high resolution artifacts.

The final smoothing pass uses DDIM with the Simple scheduler, 12 steps, and denoise at 0.18. DDIM is stable and predictable, making it ideal for subtle finishing touches.
Upscaling tends to slightly desaturate images. To compensate, the final step performs color matching against the original base image to restore vibrancy and contrast.

Through experimentation, the most reliable pattern has been multiple small controlled upscales rather than one aggressive jump in resolution.
Sampler and Scheduler Insights
Euler Ancestral is excellent for the first pass because it encourages variation and strong structural emergence. It can introduce creative diversity while still forming a coherent base.
Euler without ancestral noise works better for controlled refinement. It reduces large structural shifts and is predictable.
DPM++ 2M performs well during detail enhancement stages. It maintains anatomy and fine structure better than many alternatives when working at higher resolutions.
DDIM is less aggressive and works well for final smoothing when you want stability rather than reinterpretation.
Regarding schedulers, Normal provides balanced behavior during initial generation. Linear_Quadratic smooths the refinement curve and helps avoid sudden tonal shifts. Simple scheduler is consistent and stable during micro refinement.
In general, denoise values around 0.2 appear to be a sweet spot for iterative upscaling. Higher values tend to break anatomy at large resolutions.
Video Pipeline
For video generation, the final generated image is first downscaled to the megapixel value defined in the Video settings. It is then aligned to 32 pixels to prevent border artifacts.
The WAN 2.2 workflow is then executed. Tiled VAE decode is used because the standard decode can run out of memory at higher resolutions.
After decoding, the video is upscaled and RIFE frame interpolation is applied to improve motion smoothness. This produces more fluid animation and reduces visible stepping between frames.
Recommended Models & LoRAs
Favorite Base Models
Illustrij v20
Superb model, I have not tested v21 yet, but I am sure it is excellent as well.
IJsense v1
Significantly more realistic than Illustrij, I love it, but I mostly use it in combination with
Animij v8
Significantly more anime than Illustrij, but the combination with IJsense looks awesome. I tried v9, but the images did not come out as cleanly as this combination.
However, I will say just go to https://civitai.com/user/reijlita/models and download any model, they are all awesome!
Favorite LoRAs
Dramatic Lighting Slider - Illustrious
https://civitai.com/models/1128288/dramatic-lighting-slider-illustrious
Looks great with strength 2.0-4.0, greatly enhances lighting
Dlang_Detailed eyes-Illustrious
Just creates great looking eyes, normally I use strength 0.5
Detailer IL
Detailer that works great, activation keyword is already included in the nested workflow node. Strength normally 0.3
People's Works +
It's what the people want. Strength normally 0.3
Just try whatever you want.
Final Thoughts
This workflow focuses on clarity of control flow, structured prompt expansion, and conservative high resolution refinement.
Minimal input is enough to produce rich output. Detailed input is preserved and respected. The LLM does not replace your prompt, it enhances it.
If you prefer direct manual prompting, you can still use the workflow without relying heavily on the LLM. But when used as intended, it significantly reduces prompt micromanagement while improving scene coherence and anatomical stability.


