Sign In

A Brief Introduction to My AI Video Experience

13
A Brief Introduction to My AI Video Experience

https://forum.ideaforge.us.kg/t/a-brief-introduction-to-my-ai-video-experience/21?u=frank

I’ve been creating AI videos as a hobby for a whole year now. I’d like to share the techniques, workflows, and a few tips I’ve learned during my AI video production journey.

---

## Everything Began with Animatediff

Last year, when the AI video model Animatediff had just appeared, I created my first AI video with three scenes—swimsuit beauties, a stage singer, and fitness ladies. Even now, I still think it’s quite impressive, showcasing Animatediff’s power. That video was a typical Text-to-Video approach, which remains the mainstream in AI video today for its ease of getting started, stable images, and realistic motions. From that moment on, my AI creativity has mostly been focused on videos rather than still images.

---

## Animatediff

Animatediff became widely recognized starting with its V2 model and continued evolving through V3 and XL. There are also several community-driven, self-finetuned versions by various experts, and I’ve tried three to five of them. In reality, they don’t differ drastically. Around the time of V2 and V3, various interesting LoRAs emerged, like zoom in/out LoRA, rotating camera LoRA, plus features like SparseCtrl RGB, Sketch, and others (we’ll revisit SparseCtrl RGB later).

Even before many other AI video models appeared—and even now that more mature AI video approaches are out—Animatediff still stands strong with distinct advantages. Unfortunately, the XL model isn’t really completed. It was originally meant to pair with Stable Diffusion XL, but it’s currently just a beta version that doesn’t perform well. Yes, from my experience, it just can’t generate high-quality XL-model videos yet.

---

## Common Use Cases and Workflows with Animatediff

### 1. The Simplest Text-to-Video

You can simply use the StableDiffusion WebUI. It’s straightforward to generate videos the same way you’d create images with prompts. For more dynamic results, include keywords like “dynamic pose” or “dynamic angle” in your prompts.

### 2. Virtual Characters Dancing (Using ComfyUI)

All subsequent workflows are also done in ComfyUI. (This is why some people say ComfyUI is difficult or complex, because the platform allows for very intricate workflows and heavily customized results. After using the StableDiffusion WebUI for about 3–4 months, I decisively started learning ComfyUI from scratch.)

One common type is a smooth, anime-style dance video. After I explored text-to-video, I moved quickly into creating these kinds of videos and continued for quite some time. There are two main approaches:

*(1) Simple live-action-to-anime dance**

The workflow: take a real-life dance video, use StableDiffusion’s image-to-image process (transforming real people into anime style), then pair it with Animatediff so every frame keeps a high degree of consistency. Finally, merge all frames into a finished video. This is relatively simple, because real humans and anime characters aren’t that different in aspects like hairstyle, clothing, and physique—there’s just a visual shift in the coloring style.

*(2) Live-action to a distinctly different 2D anime character**

This is my favorite approach, but also the hardest. The ControlNet models commonly used in image-to-image transformation—OpenPose (which isn’t as accurate as one might hope) or LineArt—become challenging. If the weight is set too high, it’s difficult to transform something like a “sexy jeans-wearing lady” into a “long-haired warrior in a miniskirt.” Also, for precise poses, OpenPose often needs Depth as a supplement. But Depth or LineArt are whole-image reference models: if you set their weight too high, for instance, you can’t turn a “long-haired woman” into a “short-haired Misaka Mikoto.”

Large differences between the original person and the end result often require significant trial and error to balance ControlNet’s weights. You want an effect that’s never truly perfect, but as close as possible. Accurately representing certain actions—like a character turning around—remains difficult in current technology.

Note that these difficulties mostly arise when “the real person and the desired final character differ significantly.” If you only want a mild “live-action anime-style transformation,” it’s far simpler. To achieve a perfect solution here would almost render 3D modeling unnecessary, but at today’s level of technology it’s still quite difficult.

### 3. Slight Movements in a Virtual Character Video

Later, I began exploring how to make videos show small, screensaver-like movements. I first tried SparseCtrl RGB, where ComfyUI generates two images that are nearly identical in pose. Then, using SparseCtrl RGB + Animatediff, I link these two images in a more fluid way.

Though the results are sometimes underwhelming, occasionally I can produce something great. The advantage is excellent consistency: the character doesn’t end up radically changed—Misaka Mikoto won’t become Yui from K-On in the next second. Around this time, I realized that if I want full freedom to express my ideas under current tech constraints, the best approach is still an image-to-video strategy.

### 4. Using Prompt Schedule for a Series of Continuous Actions

Prompt Schedule is a ComfyUI node developed soon after Animatediff was released. In theory, it works well. However, I initially failed to get results because the prompt syntax is extremely strict. If you don’t adhere closely to the format, no dynamic effects will appear.

### 5. Timeframe Feature

Timeframe is another node similar to SparseCtrl RGB. It lets you generate multiple images with progressive changes, then assign precise frame numbers to each. For example:

* Frames 0–6: eyes closed

* Frames 6–12: eyes open

* Frames 12–22: leaning forward

* Frames 22–32: head tilts slightly to one side

* Frames 32–40: leaning forward slightly

* Frames 40–46: a cute one-eye blink

Imagine how adorable that sequence might be. The workflow is more complex: ComfyUI’s text-to-image first yields 5 static images with extremely high consistency, each showing small motion differences. Then you import those 5 images into the Timeframe node, assign frame ranges, pair it with the ControlNet tile model, and apply Animatediff to maintain stable frames. Ultimately, they form a short animated clip.

After using these methods for a while with Animatediff, here are a few of my expectations:

*Animatediff as an auxiliary AI video model**

When combined with StableDiffusion’s major text-to-image models, it can produce many beautiful videos. You can select virtually any style and add appropriate LoRAs for wonderful results.

*Shortcomings of SD 1.5**

However, video differs from still images. Although Animatediff V3 is powerful with SD 1.5, it can still be challenging to create elaborate, fluid motions.

*Animatediff XL**

Another route would be combining StableDiffusion XL with Animatediff XL. But the current Animatediff XL is quite underwhelming—there’s a harsh grainy effect, and the final video quality is lacking. If a refined or final version of Animatediff XL comes out to match the already mature SD XL, we’d see a big leap in clarity, freedom, and flexibility. Yet it appears the author may not continue updating it, so we can only hope.

---

## Other Image-to-Video Models in Later Stages

Because the Animatediff-based approach has various bottlenecks, I started experimenting with many other emerging video models: SVD, pyramid_flow, VideoCrafter, then more recently cogvideos, hunyuan video (which only provides a kind of hybrid image-to-video workflow so far—there’s talk of a new version in January), LTXvideos, and non-open-source solutions like runwayml, luma, keling, hailuo, and sora…

Conclusion: Right now, I’ve settled on LTXvideos as my go-to tool.

---

### SVD / pyramid_flow / VideoCrafter

These three, for me, didn’t even qualify as “brief moments of glory.” They can produce some interesting demo clips, but if you want to create a well-structured AI video with a clear concept, characters, or even a storyline, then these models aren’t flexible enough to meet detailed requirements.

---

### Cogvideos

I once considered it as a potential production tool. Its downsides: only horizontal (landscape) videos, limited action range, and it’s pretty slow to generate. The recommended VRAM is around 22G. It might be okay for minor, screensaver-type dynamic videos, but it’s not ideal for more involved creations.

---

### Hunyuan Video

Its text-to-video is very powerful, but currently its image-to-video only has an ipadapter-like style transfer solution (this description may not be 100% accurate). Essentially, ipadapter’s style transfer isn’t true image-to-image, so consistency is weaker. For instance, if you want to make a video of Dragon Ball’s Son Goku walking down the street, it’ll just “try” to incorporate that style, but you’ll never reach 100% accuracy—maybe not even 80%. So I rarely rely on it for generative AI video.

Additionally, Hunyuan is slow and consumes even more VRAM (22G is barely sufficient). Although there might be lighter variants, in practice they can’t deliver the quality most creators need. Our goal is to use a fully capable model within our hardware limits.

---

## LTXvideos: My Go-To Production Tool

This model is relatively new, having a meaningful update in December 2024. Its biggest advantage is fast generation speed. For example, on the same GPU (like an L4—more on that later), generating a 1000+ resolution, ~40-frame video with cogvideos could take at least 20 minutes, whereas an LTXvideos clip with 97 frames can finish in only about 1 minute!

Reflecting on the painful old days: converting live-action to anime-style took over 10 hours for one piece; even simpler scenarios like Animatediff + Prompt Schedule required over an hour; cogvideos still needed 20 minutes. Now, it’s down to about 1 minute, or ~3 minutes if I push the resolution above 1000. That’s a huge jump in efficiency. Truly a relief!

Another plus is it supports both portrait and landscape in many aspect ratios—less restrictive than cogvideos—and offers rich motion possibilities, which I really enjoy.

Its mechanism is similar to cogvideos, as both support genuine image-to-video generation. Technically, any image (even ones downloaded from the web) can yield decent motion results. My usual workflow: I have a custom ChatGPT (GPTS) generate prompts for 4 frames with a short storyline, then use the flux dev model to create those 4 images. I feed them into LTXvideos’ official image-to-video pipeline to animate them, resulting in a cinematic effect.

## Brief Comments on Closed-Source AI Video

Runwayml: Powerful, not free, fast. Sometimes has issues with consistency: the character might randomly switch from Asian to European. Its motion results seem a bit better than LTXvideos.

Keling: Also powerful, fairly consistent, fast, and roughly equal to LTXvideos in motion quality.

LUMA: Not as strong—slower generation, so-so consistency, and a higher chance of character distortion or motion twists.

Sora: Extremely strong. When it launched, I almost gave up on open-source AI video entirely. I believe Sora and DALL·E from the OpenAI family have unique mechanisms, offering holistic improvements, fewer issues with hands, and better overall prompt comprehension.

My take on closed-source models: Right now, they aren’t great options for freeform creative projects. Most casual users can only “try them out.” Since they’re paid and subject to platform policies, their creative freedom is limited—especially with Sora.

---

## Hardware & Software

I’m a heavy Colab (Pro+) user. I typically use an L4 GPU, which handles all the methods described above. I use a modified Colab version of ComfyUI that I found online. I also subscribe to ChatGPT, though for prompt generation alone, free versions might suffice.

I’m always open to exchanging ideas and insights.

13

Comments