home models images videos posts articles bounties challenges events updates shop

Wan 2.1 50s continuous take test

infearia

Loading Images

Link to original thread on Reddit:

https://www.reddit.com/r/StableDiffusion/comments/1mnxdy6/wan_21_vace_50s_continuous_shot_proof_of_concept/

Can't edit my original post, so I have to put the explanation in a separate comment.

I'll first copy&paste a response I made last night to a question in this thread, because it actually sums up my whole method pretty succinctly, and will later elaborate on it:

The basic idea is that if I want to generate a sequence of multiple short videos in order to stitch them together into one long shot, then instead of the typical method of rendering the first video and using its last frame(s) as the start frame(s) to the second video, I generate the videos in the opposite order. I render the last video in the sequence first, then use its start frames as the end frames for the second last video and so on. Finally I stitch the videos as usual and use some cross-fading to hide the seams.

The main question is, how to generate the sequence of short videos, starting at the end? So far I came up with two approaches:

V2V - generate one long control video (depth, canny, pose etc.) ahead of time and cut it up into chunks of <= 81 frames, then render the chunks out, starting with the last one, then the second last etc. This is the method I used to generate the video in my original post and you will find a step-by-step walkthrough of the whole process below.
T2V/I2V - similar to 1), but instead of creating the "chunks" by cutting up an existing control video, simply plan your shot ahead of time and split it into sequences (each possibly with its unique prompt and start/end/keyframes) that can be rendered in <= 81 frames each, then render them last to first and then stitch them together, same as method 1). This is still theoretical, but I plan to test it next and if I succeed I will write another, more detailed post about it.

I understand my explanations may sound a bit confusing at first. I'm in a bit of a hurry so the text is not as polished as it could be, but I also don't want to keep you waiting. If something is unclear just re-read it a few times and if you still have questions just ask, and I will try to answer them as best as I can.

Step-by-step walkthrough:

Using the free Abandoned City Generator (https://mihamarinko.gumroad.com/l/abandonedCity) I generated a simple city scene in Blender (no lights or materials, just meshes)
Created a simple keyframe animation of a camera flying through the desolated city
Rendered a Z-Pass of the scene (800 frames, 960x544px, 16fps) - this will be the depth control video for VACE
Used a simple prompt to describe an abandoned metropolitan city and the last 81 frames of the Z-Pass video (frames 719 to 800) as control video to generate a 5sec video (Video A)
Generated another 81 frames video (Video B), by using Z-Pass frames 648 to 719 (648 = 800 - 81 - 71) plus the first 10 frames from Video A as control video
Generated another 81 frames video (Video C), by using Z-Pass frames 577 to 648 (577 = 800 - 81 - 71*2) plus the first 10 frames of Video B as control video
Generated another 81 frames video (Video D), by using Z-Pass frames 506 to 577 (506 = 800 - 81 - 71*3) plus the first 10 frames of Video C
I think you can see now where this is going...
Since this was my first test, I miscalculated a bit and ended up with 90 frames to render for the final video instead of 81. However, VACE would only let me render either 89 or 97 frames, so I ended up rendering 89 and just left the first frame out.
I now had a sequence of 11 videos: K J I H G F E D C B A. The last 10 frames of video K overlapped with the first 10 frames of video J, the last 10 frames of video J overlapped with the first 10 frames of video I, and so on and so forth. Well, the videos did not exactly overlap, because due to how VACE works, there was color shift and deterioration between corresponding frames. I would try to minimize this effect in the next step.
Now, I went through every pair of sequential videos (except the last one, A) and did the following:

Grabbed the last 10 frames of the first video (X) and the first 10 frames of the next video in the sequence (Y)
Created a cross-fade effect between the frames using Kijai's Cross Fade Images node
Used the resulting 10 cross-faded frames to overwrite the last 10 frames in video X
Deleted the first 10 frames from video Y
Repeated the process for every video except the last one

Now all that was left to do was to stitch the frames together, crop them and I ended up with a 50 seconds animation with no quality degradation and barely visible seams!

Wan 2.1 50s continuous take test

Comments