This log will focus on Sound to Video.

Return to the Moon 🌙

https://civitai.red/articles/27881/hhhunters-logbook-02-moon-overview

Sound to Video ( s2v )

Where sound starts shaping motion ...

Well, for starters, just look at this :

https://www.comfy.org/workflows/templates-wan2_1_infinitetalk_music-1eab7aa23f6a/

Crazy, right ?

This is the incredible result of Infinite Talk ... Boss Level !!

But damn, the particular workflow you'll find on that page is extremely demanding :'(

I still haven't managed to run it successfully ... 😢

So I started looking for an alternative to get usable lip-sync results ...

That is when I tried this one : video_wan2_2_14B_s2v

( workflow attached )

Of course, it is not as accurate as Infinite Talk ...

But hey ... it got the job done ...

At least for my first kinky music video attempt :

https://civitai.red/posts/28325862

But as you can see, it's far from perfect ...

It tends to deform faces quite a lot !!

At least on my side, I never managed to stabilize it the way I wanted.

Important note : Audio separation

When trying to use the video_wan2_2_14B_s2v workflow,

I ran into a small issue ...

Sometimes, the lips were moving on drum hits, or on guitar / synth riffs ...

Which makes sense, in a way ...

So I had to find a way to separate the vocals from the rest of the audio !

Luckily for me, there is a part of the Infinite Talk workflow that does exactly that !

So I isolated that part of the workflow
( you'll find it attached as well ) ...

And it works like a charm !

Just drop a song into the Audio Loader,
run the workflow,
and voilà !

Then you can use the isolated vocal track to get much better lip-sync results.

LTX 2.3

Anyway, still not fully satisfied with the result of the music video ...
I kept looking for alternative solutions ...

My next step was to try some of the workflow proposals built around LTX 2.3,
just to see if things were any better on that side ...

... and honestly ??

That stuff is crap 😅

Don't waste your time on those workflows.

Infinite Talk

So in the end, I told myself :
alright, let's take Infinite Talk more seriously ...

Because in terms of lip-sync precision,
it clearly looked like the strongest option !

And this time, instead of trying the workflow shown on the page above,
I decided to try the one included in the ComfyUI templates ...

And ... VICTORY, my friends !! 🥳🥳🥳

Not only does it run,
but the results are honestly very convincing !

You'll see that in the next clip
( I'm working on it ).

Now, the Infinite Talk workflow included in the ComfyUI templates
is designed for multi-talk situations,

with a mask system and audio concatenation ...

And I did not need any of that.

So I simplified the workflow for my own use case :

one singer only, nothing more, nothing less !!
( you'll find that workflow attached )

Just be careful :

I also modified the workflow to work on two blocks of 181 frames at 25 FPS ...

Because I generate full songs around 2 minutes long,
I first separate the vocals from the rest of the track.

Then I take the vocals-only version into Cubase
and slice the song into 9 parts of 14 seconds each.

For now, I generate each shot in 2 blocks of 181 frames,
which gives me roughly 14 seconds per run in my current setup.

That means 9 generated shots for one full song.

At around 19 minutes per shot,
the whole thing takes about 2 hours 51 minutes of generation for the 9 shots.

Honestly, for that level of quality,
I find that pretty reasonable ...

Well ... when everything goes well 😅

Because of course,
we are never completely safe from a cursed seed.

Current Summary

Infinite Talk

Super precise lip-sync ...
A bit rigid, but maybe I can still make the shots feel more alive ...

S2V

Quite unstable ... Less precise ...
Sometimes interesting ( more lively ), but it often goes off the rails ...

LTX

Best avoided !!

Current Setup

All the numbers and observations shared here are based on this setup.

CPU : Intel Core i7-14700KF
RAM : 32 GB DDR5 / 6000 MT/s
GPU : NVIDIA GeForce RTX 4070 / 12 GB VRAM

When VRAM is limited -- or unavailable -- CPU and RAM matter a lot because offloading becomes part of the process.

Work in progress ...

HHHunter's Logbook - #07 - Saturn - Sound to Video

Navigation