Sign In

My High-level Guide to Local Movie Generation

My High-level Guide to Local Movie Generation

Step-by-step Process

  1. Write a beat sheet or script.

  2. Generate dialogue.

    1. This is currently the only area where I have to fallback to closed source. I've tried multiple local audio cloning tools like https://huggingface.co/spaces/mrfakename/E2-F5-TTS, https://rentry.co/GPT-SoVITS-guide#/, and https://github.com/serp-ai/bark-with-voice-clone. They're all bad in their own way. Elevenlabs is currently the only option for voice cloning and generative speech that doesn't sound unintentionally hilarious.

  3. Generate sound effects.

    1. You can do this for free on Elevenlabs if you just want to browse other sound effects. If you want to generate your own sound effects from text, that requires credits.

    2. https://github.com/kijai/ComfyUI-MMAudio/tree/main is also a way to add sound effects and music to clips. I haven't used this one yet personally.

    3. Dropping an episode of a show or a movie into Audacity will often break the audio into different tracks, typically with the sound effects more or less isolated on their own track(s). That makes them easy to clip out for reuse.

  4. Add music.

    1. Special callout to https://mvsep.com/en here and the BANDIT Plus audio separation model. This is free and pretty good at separating vocals from music if you want an instrumental version of something. It’s also useful for cleaning up vocals if you’re trying to clone a voice from TV/movies.

    2. You can use a free virtual audio cable (https://vb-audio.com/Cable/) to record high-fidelity audio playing on your computer through streaming apps. I do this sometimes as I'm roughing-in music. Saves me having to play a song on Spotify along with my clip every time I want to check to see if the music matches the scene.

  5. Generate imagery.

    1. I personally use Flux since that's what I've trained my LoRAs on and I enjoy its prompt adherence.

    2. Because image gen is such a crapshoot still, I usually brainstorm a few ideas for the framing of a scene then feed that into ChatGPT to give me variations. For example, "Give me 10 variations of this prompt that explore different framing, lighting, angles, and perspectives: A sorceress sitting opposite a warrior at a table in a medieval tavern."

    3. One thing you can do to give your images a consistent cinematic feel is preface your prompt with the mood or atmosphere. I tend to use keywords like "cinematic still, moody atmosphere." For example, "moody atmosphere" generally washes the image in a blue/green tone that feels David Fincher-esque. Adding that key phrase to your prompts helps with consistency so you don't end up with images that are stylistically or tonally all over the place.

    4. Once I have my 10 versions of a prompt for a single shot, dump those into Forge in the "prompts from file or textbox" and set it to produce 25ish images for each prompt. Then I select the best images.

  6. Generate video. I exclusively use I2V; T2V is far too random and insane for even a half-serious project.

    1. https://github.com/kijai/ComfyUI-CogVideoXWrapper is still my personal favorite. This gives me the best mix between output quality and coherence.

    2. https://github.com/Lightricks/ComfyUI-LTXVideo/tree/master is fast, unbelievably fast. But quality-wise it's only ok at best.

    3. https://github.com/kijai/ComfyUI-HunyuanVideoWrapper officially this is only T2V right now but there is some experimentation with I2V and I2V is expected to be officially supported within the next month or so. I see this becoming the new de facto local video gen solution.

  7. When lip syncing is needed:

    1. https://github.com/kijai/ComfyUI-LivePortraitKJ

    2. https://github.com/kijai/ComfyUI-MimicMotionWrapper (I haven't used this personally but I've heard it does ok at cloning face movements and head position).

    3. If I can't find an existing video suitable to clone a dialogue delivery performance from, I'll just record a short video of myself reciting the lines. That is then used as the "driving" video for LivePortrait.

    4. Even with absolute state of the art tooling, lip syncing is still painfully obvious and distracting; even Hollywood doesn't have a solution for this. The workaround is to make smart choices as a director. Use lip syncs sparingly and play with lighting and camera angles so that you're not highlighting the weakest component of your project.

  8. Cut it all together with Shotcut or the free video editor of your choice.

Tips & Tricks

  1. Using reversed videos to overcome limitations. Imagine that you have a character who uses a very distinctive knife or dagger. You want a scene where the dagger enters from out of frame and plunges into an enemy. The problem is, without having the dagger already in the image, your gen won't know to produce exactly that distinctive dagger. The solution is to reverse both the prompt and the video. Prompt an image for "X's dagger just penetrating an enemy." Then feed that image to your video gen and say "A dagger pulls back out of a person as blood gushes." That gives you a video that you can then reverse to achieve the desired effect. There are nodes in ComfyUI to reverse videos (just search "reverse" when pulling up a new node). You can also use https://ezgif.com/reverse-video for free.

    I used this approach for https://civitai.com/images/47022746. The shards of mirror that rise out of the box need to look a certain way when assembled, so I had to start with the assembled and risen mirror pieces, then prompt the video for "pieces of a broken mirror falling into a box," then I reversed that video to achieve the desired effect.

  2. Use de-distilled checkpoints with Flux. I hit a wall early-on with image generation for my Castlevania project. Even though I trained LoRA's on Sypha's unique elemental magic capabilities, the most popular Flux checkpoints would not render her magic faithfully. When I switched to https://civitai.com/models/843551/fluxdev-dedistilled I saw much better results. This also helped with night creature and violence-related prompts. De-distilled also honors negative prompts which is useful for me in coercing the model into producing photorealistic images even when the LoRAs were trained exclusively on animated scenes. The major downside of de-distilled checkpoints is that the inference time is extremely long (I usually need 60 steps per image). But for me it was often the only practical option.

Philosophy

The tech in this space is changing so rapidly it’s almost pointless to try to put together a detailed how-to; most of it will be obsolete in six months. The advice I can give with the most permanence is to build up your idea from a solid foundation: A great concept or script can stand on its own. A good vocal performance with sound effects can stand on its own. I’m not as interested in staring at a slowly rotating image of a random character in space or a video of a spastic dancer; those don’t stand on their own. That’s why I recommend deferring the image and video gen aspects of AI media creation until the end.

When the barrier to entry is lowered for something that was once exclusive, it's inevitable to see a flood of low-quality content. In my lifetime I've seen this with four-track recorders, DAWs, self-published books, YouTube, TikTok, and now AI. The upside is that it gives creative people who would otherwise be undiscoverable or unable to compete for lack of money, connections, etc. an outlet for their talent. As the novelty of AI media generation wears off over the next few years I really look forward to the unique, original content that will be brought to life. Truly, the only limit will be your imagination.

35

Comments