santa hat
deerdeer nosedeer glow
Sign In

LTX IMAGE to VIDEO with STG and autocaption workflow

66
988
17
Updated: Dec 11, 2024
toolimage2videoltx-video
Type
Workflows
Stats
640
Reviews
Published
Dec 8, 2024
Base Model
LTXV
Hash
AutoV2
73B518A30C

Workflow: Image -> Autocaption (Prompt) by Florence -> LTX Image to Video with STG

(creates up to 10sec clips in less than 1 min, proofed working on 12gb VRam, maybe lower)

--

V3.0: Introducing STG (Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling).

Included a SIMPLE and an ENHANCED workflow. Enhanced Version has additional features to upscale the Input Image, that can help in some cases. Recommend to use the SIMPLE Version.

  • replaced the height/width Node with a "Dimension" node that drives the Videosize (default = 768. increase to 1024 will improve resolution, but might reduce motion, also uses more VRAM and time). Unlike previous Versions, Image will not be cropped.

  • Included new node "LTX Apply Perturbed Attention" representing the STG settings (for more details on values/limits see the note within the workflow) .

  • Enhanced Version has an additional switch to upscale Input Image (true) or not (false). Plus a scale value (use 1 or 2) to define the size of the image before being injected, which can work a bit like supersampling. (workflow: Upscale Input Image-> apply CRF (Compression) -> Resize -> Inject into LTX Image to Video). As said, not required in most cases.

Pro Tip: Beside using the CRF value to drive movement, increase the frame rate in the yellow Video Combine node from 1 to 4+ to trigger further motion when outcome is too static. (Thanks to reddit user jhow86)

Node "Modify LTX Model" will change the model within a session, if you switch to another worklfow, make sure to hit "Free model and node cache" in comfyui to avoid interferences.

--

ComfyUI Workflow for Image-to-Video with Florence2 Autocaption (v2.0)

This updated workflow integrates Florence2 for autocaptioning, replacing BLIP from version 1.0, and includes improved controls for tailoring prompts towards video-specific outputs.

New Features in v2.0

  1. Florence2 Node Integration

    • Florence2 now appears in the GUI as a selectable node.

    • Options include generating captions at varying levels of detail: "caption," "detailed caption," or "more detailed caption."

  2. Caption Customization

    • A new text node allows replacing terms like "photo" or "image" in captions with "video" to align prompts more closely with video generation.

    • Alternative terms such as "animation" or "clip" can also be used to influence the output style.


Key Features Retained from v1.0

Enhanced Motion with Compression

To mitigate "no-motion" artifacts in the LTX Video model:

  • Pass input images through FFmpeg using H.264 compression with a CRF of 20–30.

    • This step introduces subtle artifacts, helping the model latch onto the input as video-like content.

    • CRF values can be adjusted in the "Video Combine" node (lower-left GUI).

    • Higher values (30–40) increase motion effects; lower values (~20) retain more visual fidelity.

Autocaption Enhancement

  • Text nodes for Pre-Text and After-Text allow manual additions to captions.

    • Use these to describe desired effects, such as camera movements.

Adjustable Input Settings

  • Width/Height & Scale: Define image resolution for the sampler (e.g., 768×512). A scale factor of 2 enables supersampling for higher-quality outputs. Use a scale value of 1 or 2.


Simple Workflow Description

  1. Load Image: Drag and drop an image into the GUI (lower-left panel).

  2. Preprocessing: ComfyUI automatically:

    • Upscales the image.

    • Applies compression (via FFmpeg, CRF adjustable).

    • Resizes the image to fit LTX model requirements.

  3. Autocaptioning: Florence2 generates captions with optional Pre-/After-Text modifications.

  4. Video Generation: The processed image is sent to the sampler, where motion is synthesized based on parameters set in the GUI.

  5. Rendering: Output is compiled into a video file.


Pro Tips

  • Motion Optimization: If outputs feel static, incrementally increase the CRF value or adjust Pre-/After-Text nodes to emphasize motion-related prompts. Try changing the scale of your input image, use a value of 1 or 2.

  • Fine-Tuning Captions: Experiment with Florence2’s caption detail levels for nuanced video prompts.

  • Want to use your own prompt? Check the green node "CLIP Text Encode (Positive Prompt)" in the florence section of the workflow and right-click it and select "convert Input to Widget", then a textfield will apear to be used for prompting. You will still see the florence caption in the gui.