This guide is now outdated, it is only valid up to version 1.2.

Step-by-Step Guide Series:
ComfyUI - FACE to VIDEO Workflow 1.X

This article accompanies this workflow: link

Foreword :

English is not my mother tongue, so I apologize for any errors. Do not hesitate to send me messages if you find any.

This guide is intended to be as simple as possible, and certain terms will be simplified.

Workflow description :

The aim of this workflow is to generate video using the face from an existing photo in a simple window.

Prerequisites :

ComfyUI

Models :

I2V Quant Model : city96/Wan2.1-I2V-14B-480P-gguf at main
In models/diffusion_models
Recomandation :
- 24 gb Vram: Q8_0
- 16 gb Vram: Q5_K_S
- <12 gb Vram: Q4_K_S
CLIP : split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors · Comfy-Org/Wan_2.1_ComfyUI_repackaged at main
in models/clip
CLIP-VISION : split_files/clip_vision/clip_vision_h.safetensors · Comfy-Org/Wan_2.1_ComfyUI_repackaged at main
in models/clip_vision
VAE : split_files/vae/wan_2.1_vae.safetensors · Comfy-Org/Wan_2.1_ComfyUI_repackaged at main
in models/vae
UPSCALE MODEL : ESRGAN/4x_NMKD-Siax_200k.pth · uwg/upscaler at main
in models/upscale_models

FLUX GGUF_Model : city96/FLUX.1-dev-gguf at main (huggingface.co )
"flux1-dev-Q8_0.gguf" in ComfyUI\models\unet
FLUX GGUF_clip : city96/t5-v1_1-xxl-encoder-gguf at main (huggingface.co )
"t5-v1_1-xxl-encoder-Q8_0.gguf" in \ComfyUI\models\clip
FLUX text encoder : ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF.safetensors · zer0int/CLIP-GmP-ViT-L-14 at main (huggingface.co )
"ViT-L-14-GmP-ft-TE-only-HF-format.safetensors" in \ComfyUI\models\clip
FLUX VAE : black-forest-labs/FLUX.1-dev at main (huggingface.co )
"ae" in \ComfyUI\models\vae
FLUX PuLID : pulid_flux_v0.9.0.safetensors · camenduru/PuLID at main
"pulid_flux_v0.9.0" in \ComfyUI\models\pulid

Custom Nodes :

PuLID need your python to have Insightface :

Check your python version :

for windows portable version : (the path depends on where you unzipped ComfyUI)

Download Insightface whl which corresponds here : Assets/Insightface at main · Gourieff/Assets
(Here my local python is on 310 and mobile version in 312)

Then install all prerequisites and insightface :
python.exe -m pip install --use-pep517 facexlib
python.exe -m pip install git+https://github.com/rodjjo/filterpy.git
python.exe -m pip install onnxruntime==1.19.2 onnxruntime-gpu==1.15.1 insightface-0.7.3-cp310-cp310-win_amd64.whl

Don't forget to close the workflow and open it again once the nodes have been installed.

Usage :

One preproduction workflow :

Models used by flow,
a node to add LoRA.

The main workflow is composed of 4 main parts :

Configuration : where you define what you want,
Files : what required for workflow operation,
1st frame : import the image of the face you want to reproduce and describe the first frame,
Ouput : video display and saving.

And two optional parts :

Upscale : allows you to increase the video resolution
Interpolation : allows you to generate intermediate frames for greater fluidity

First frame files :

Choose your flux model :

Here, you can switch between Q8 and Q4 depending on the number of VRAMs you have. Higher values are better, but slower.

Choose your clip and text encoder :

As for the previous node.

Dont change vae and PuLID model :

Add as many LoRAs as you want :

I have not personally tested this workflow with LoRAs.

Configuration :

Write what you want in the “Positive” node :

Write what you dont want in the "Negative" node :

Select image format :

The larger it is, the better the quality, but the longer the generation time and the greater the VRAM required.

Choose a number of steps :

I recommend between 15 and 30. The higher the number, the better the quality, but the longer it takes to generate video.

Choose number of frames :

A video is made up of a series of images, one behind the other. Each image is called a frame. So the more frames you put in, the longer the video.

Choose the guidance level :

I recommend to star at 6. The lower the number, the freer you leave the model. The higher the number, the more the image will resemble what you “strictly” asked for.

Choose a Teacache coefficients :