S2V/I2V lip-synced music video assembly (tutorial/personal log)

(I am writing this mostly as a personal log so that I can retrack my steps should I want to try a similar thing down the line and kind of forgotten exactly what I did, recurring theme of my life)

For anyone that isn't me (i.e. you!), this is the final video for your reference:

This workflow involves various models/tools, but no actual files will be provided as separate downloads (they're mostly default ones anyway, you can throw these into comfy/swarm and it should give you what I used). This workflow by m8rr was my base for the LTX2 generation with a few tweaks.

I call this "detailed" because my process breaks the generation into individual elements that can be more finely controlled, I can reroll for individual parts to be good instead of praying that the I2V gets every detail in a 15-second video correct.

Goal: Animate an album art image to sing the song with proper lipsync

These are all the tools/models/things I used:

Flux, on SwarmUI (for initial album cover image featuring Phandigrams LORA)
ZIT, on SwarmUI (for some I2I editing)
Qwen Image Edit on ComfyUI (isolating image elements)
LTX-2 on ComfyUI (for video, duh)
Adobe Photoshop (overkill, any basic image editor will work)
Davinci Resolve (key feature: greenscreen / keying)
Suno's music stem splitter (isolate vocals, there are many open source alternatives too)

This is the album art in question, straight out of T2I:

2253001-by Phandigrams an anime style digital i-svdq-fp4_r32-flux1-dev-1.jpg

I want her to sing with a bit of expressiveness, and I want the background to be slightly animated with softly falling snow. I extract the vocals, throw it into LTX2 (distill), and re-dub the original audio, and this is one of the first few videos I get:

Not a bad result, but I immediately identify a few different elements that I want to get correct:

She needs to be the right amount of expressiveness. She's slightly under-expressive for my tastes here. In some generations she's barely moving, and in some she's doing glitched out 180s/360s
Her style changes a lot as the video goes on, most noticeably the blush on her cheeks, her fingernails, and the appearance of a ring on her ring finger. (the ring is something I'm always fighting in LTX. If anyone wants to make a de-ring-ifier LORA I will be super super happy.)
The video is on the long side so there is a looot of time for LTX2 to get things wronk.
The snow needs to fall at a realistic rate. In this generation its a bit too fast. In some generations the snow didn't fall at all. In some generations the snow didn't match the style (it gave me hexagonal snowflakes)
The obscured text needs to be rendered correctly. LTX2 doesn't really do text so I definitely cannot rely on rerolling regens

If I isolate these elements into their own pathways, I can be much more controlled in getting a good output and not need to rely on luck:

From a theory/logic standpoint, lets say there are four variables to get correct, and per variable a rough estimate gives a 1 in 5 chance of a generation getting it correct. That's a 1 in 625 chance to get every variable correct. At a rate of 3 minutes per attempt, that's over a solid day of rendering time. Not to mention two and a half hours to watch all those attempts.
"Correct" doesn't even mean I will like it. I can generate dozens of images before I settle on one. So if I change those odds to 1 in 20, the number of generations required becomes astronomically high.
If I can keep each variable isolated, I only need to get run those 1 in 5 / 1 in 20 generations four times. There is a lot of extra overhead and manual work in splitting and merging, which is what the rest of this article will break down.

Section 1: Expressiveness and Style Consistency

If I can isolate the singer and choose a few key poses, I can feed those key poses back as intermediate frames to get her animating smoothly between them.

LTX2 is the star of the show, but Qwen Image Edit 2509 is our best supporting actor. I use Qwen to remove the background, and change it into a greenscreen:

Qwen is great at stuff like this and nails it on the first attempt.

(Note to self: Next time around I should manually input a color using muiti-image input, consider the background and choose a color that is less intrusive in case there are artifacts)

I then use this image as the input image for LTX2, to look for good poses/keyframes. I don't need to run at full framerate, since I am exploring poses and not really interested in smooth animation. Cutting the framerate means I can speed up the video generation time.

This is one of the results I get:

I liked how she held her hands up facing her head at "aching", as well as the way her skirt billowed out during the rock entry of the song, so I extract that frame and noted which frame it was at.

I repeat the process to look for one more keyframe. There are many non-cohesive details (jewelry changes, nail color changes, etc.) but that is fine since I will correct those next:

I assemble the three keyframes plus the original starting frame into this grid using an image editor:

This grid then gets rerun through I2I at ~0.40 denoise with a similar prompts as the starting T2I image, and I get this:

0153001-a 2x2 grid of anime style digital illust-svdq-fp4_r32-flux1-dev-3.jpg

(Note to self: The reference images aren't exact 1:1 so its okay if her lips don't perfectly match the input image)

By running the images as a grid instead of separately, I can ensure the minor details stay consistent throughout - same belt, same necklace, same hair ribbon, etc.

The grid is split back into the four images, and I place these at the appropriate frames for that pose. There is no need to set a "final" -1 keyframe, its actually better to set it a little bit earlier than that to minimize the how many frames there are between reference images (the more frames between reference images, the more time LTX2 has to add unwanted/incorrect details)

(Note to self: try with 6 keyframes next time, since I can fit a 2x3 grid into a square image for I2I reprocessing)

After about 10 generations (~30 minutes) I get this, which I am happy enough with:

It's not perfect, (that goddamn ring still makes an appearance) but its much much better than ones without the reference frames. If I really wanted to I could spend more time rolling for generations, or go into Pro and negative prompt it. (at a ~4x speed penalty on my system)

Section 2: Fixing the background (snow and text)

Compared the the model, the background is a much easier fix since there isn't too much going on.

I extract the background using our old buddy Qwen Image Edit 2509. Unlike LTX2, Qwen knows English so I can simply tell it exactly what text was obscured by the singer:

It gets this on the first batch of generations.

I want the snow added into the middle to match the rest of the snow, and I thought it would cool to make the moon a crescent moon instead of a full moon, so I do some easy I2I work to get this:

0003001-An anime style digital illustration depi-svdq-fp4_r32-flux1-dev-1.jpg

Which gets thrown back into LTX2:

Got this result after 4 generations. Nobody will notice those trees changing weirdly since they're mostly behind the singer. Background done.

Section 3: Final Assembly

With the singer and background isolated into their own videos, assembly is simply a process of doing a greenscreen composite. I use Davinci Resolve (its free!), the key effect to look for is 3D Keyer. It also took me 30 minutes to find this goddamn options which the online guides kept mentioning but was moved or something since the guide was made, so here is a screenshot so that I (and maybe you) can find it quickly next time because this was not the first time I spent an unreasonable amount of time looking for this option sadkl;fjasl;fjaefew9j3s3

(the OpenFX Overlay part, highlighted menu item)

This is where I noticed there is still a little bit of green artifacting, so as mentioned earlier I should choose a better colour next time (e.g. darker green but not the same as the one on her skirt, blue, darkish yellow would have been good options)

I also realized that the background was 15 seconds when I have 16 seconds of singing, but w/e its just the background so i stretch it out, nobody will ever notice that.

With the timeline set up I export and I am complete!

Things I should consider next time:

Chaining video generation. My system can do 20 seconds of video, but it takes a really long time to generate and I have to close literally everything (including the browser running ComfyUI GUI). I've seen workflows (Benji) where many ending frames of the output are used to chain as the next input. This will give me even finer control, I can make separate video generations for each line of the song etc.
Doing a final V2V generation pass to get rid of greenscreen motion blur artifacts. Maybe something like feeding the exported video's latent directly into the Stage 2 detailer of LTX2? Anyone know if this will work?

If this guide helped you out:

Of course, direct buzz tips are the most obvious, but not everyone has the means to do so, plus I don't even use buzz much, so you can also:
Post an image to one of my models, here is CivChan for Flux (PG-13) and here is a SD 1.5 Austrian alpine town background (PG), more in my profile
Like/Follow one of my Instagram accounts, here is a 100% SFW one I started recently: Topaz Sights
Give the song which I made on Suno a listen, plus a like/comment if you have an account, it is a contest entry and engagement metrics are considered in the contest.