Type | Workflows |
Stats | 640 |
Reviews | (58) |
Published | Dec 8, 2024 |
Base Model | |
Hash | AutoV2 73B518A30C |
Workflow: Image -> Autocaption (Prompt) by Florence -> LTX Image to Video with STG
(creates up to 10sec clips in less than 1 min, proofed working on 12gb VRam, maybe lower)
--
V3.0: Introducing STG (Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling).
Included a SIMPLE and an ENHANCED workflow. Enhanced Version has additional features to upscale the Input Image, that can help in some cases. Recommend to use the SIMPLE Version.
replaced the height/width Node with a "Dimension" node that drives the Videosize (default = 768. increase to 1024 will improve resolution, but might reduce motion, also uses more VRAM and time). Unlike previous Versions, Image will not be cropped.
Included new node "LTX Apply Perturbed Attention" representing the STG settings (for more details on values/limits see the note within the workflow) .
Enhanced Version has an additional switch to upscale Input Image (true) or not (false). Plus a scale value (use 1 or 2) to define the size of the image before being injected, which can work a bit like supersampling. (workflow: Upscale Input Image-> apply CRF (Compression) -> Resize -> Inject into LTX Image to Video). As said, not required in most cases.
Pro Tip: Beside using the CRF value to drive movement, increase the frame rate in the yellow Video Combine node from 1 to 4+ to trigger further motion when outcome is too static. (Thanks to reddit user jhow86)
Node "Modify LTX Model" will change the model within a session, if you switch to another worklfow, make sure to hit "Free model and node cache" in comfyui to avoid interferences.
--
ComfyUI Workflow for Image-to-Video with Florence2 Autocaption (v2.0)
This updated workflow integrates Florence2 for autocaptioning, replacing BLIP from version 1.0, and includes improved controls for tailoring prompts towards video-specific outputs.
New Features in v2.0
Florence2 Node Integration
Florence2 now appears in the GUI as a selectable node.
Options include generating captions at varying levels of detail: "caption," "detailed caption," or "more detailed caption."
Caption Customization
A new text node allows replacing terms like "photo" or "image" in captions with "video" to align prompts more closely with video generation.
Alternative terms such as "animation" or "clip" can also be used to influence the output style.
Key Features Retained from v1.0
Enhanced Motion with Compression
To mitigate "no-motion" artifacts in the LTX Video model:
Pass input images through FFmpeg using H.264 compression with a CRF of 20–30.
This step introduces subtle artifacts, helping the model latch onto the input as video-like content.
CRF values can be adjusted in the "Video Combine" node (lower-left GUI).
Higher values (30–40) increase motion effects; lower values (~20) retain more visual fidelity.
Autocaption Enhancement
Text nodes for Pre-Text and After-Text allow manual additions to captions.
Use these to describe desired effects, such as camera movements.
Adjustable Input Settings
Width/Height & Scale: Define image resolution for the sampler (e.g., 768×512). A scale factor of 2 enables supersampling for higher-quality outputs. Use a scale value of 1 or 2.
Simple Workflow Description
Load Image: Drag and drop an image into the GUI (lower-left panel).
Preprocessing: ComfyUI automatically:
Upscales the image.
Applies compression (via FFmpeg, CRF adjustable).
Resizes the image to fit LTX model requirements.
Autocaptioning: Florence2 generates captions with optional Pre-/After-Text modifications.
Video Generation: The processed image is sent to the sampler, where motion is synthesized based on parameters set in the GUI.
Rendering: Output is compiled into a video file.
Pro Tips
Motion Optimization: If outputs feel static, incrementally increase the CRF value or adjust Pre-/After-Text nodes to emphasize motion-related prompts. Try changing the scale of your input image, use a value of 1 or 2.
Fine-Tuning Captions: Experiment with Florence2’s caption detail levels for nuanced video prompts.
Want to use your own prompt? Check the green node "CLIP Text Encode (Positive Prompt)" in the florence section of the workflow and right-click it and select "convert Input to Widget", then a textfield will apear to be used for prompting. You will still see the florence caption in the gui.