How to make a modular AI Influencer/Character for Wan Video

Hi! This is part of my Patreon tutorial but I am going to paste this here before I finish the video tutorial :)

Remember. I'm the first that did it! LOL! 😂😂😂

Tutorial: https://www.patreon.com/posts/modular-ai-toon-138307733?utm_medium=clipboard_copy&utm_source=copyLink&utm_campaign=postshare_creator&utm_content=join_link

Epic Kohya JSON you'll need (thanks to user The_Last_Goblin_King for reminder) :
https://mega.nz/file/QAgTUZ6Y#K533KW5KEAcWYnYFJsqZUYRIEMG5tpmGdqMgmsIkoF0
_________________________________

AI Influencer Tutorial Table of Contents

How to gather or create a dataset
How to train the preliminary Pony (SDXL) model (which you will use for the reference images)
How to generate images with your preliminary ‘image’ model
How to sort your generated images for usage
How to edit your generated images to create 4 modular datasets of the face, body, upper body, and lower body to create your ‘parts’ library for later characters.
How to caption your dataset so that the resulting models work correctly.
How to train your new AI model with all 4 datasets using your own hardware (offline) or using a cloud GPU (online)
Turning your images to videos to train on using AI image-to-video or recording your own videos as your AI influencer
How to use your AI video model to create high-quality videos with consistent face/body.

Welcome to the

AI Influencer/Toon To Real Life AI Video Training Workflow
I know it sounds like a mouthful but this is pretty easy if you just follow the main steps easily. It will teach you how to gather or generate a selection of images, turn them into a realistic-looking AI person, and then create an AI video LoRA with them to use them with state-of-the-art consumer hardware AI apps like Wan 2.2. Then you can prompt your AI Influencer to do whatever you want in a video with just a few sentences!

Difficulty Level

Intermediate/Pro - This tutorial is not easy or for beginners but is explained the easiest way possible.

Requirements

This tutorial assumes you already know how to download and install Stable Diffusion, Wan Video, ComfyUI and Kohya_SS. You will need those 4 things to get started. I am doing this with 96GB RAM and an RTX 3090. You could use a basic computer and simply use paid online cloud GPUs to do all of this also, but those may cost around $10 to do this entire process with testing over the course of 5-10 hours (training time with the PC idle).

Definitions:

LoRA - Low-Rank Adaptation Model. This is a smaller version of a model which gives the AI all the information it needs about a character and how to ‘draw’ or ‘render’ them. Most AI apps don’t know what your character looks like even if you describe or name them, so we’ll use images that we create, or from the internet, to teach it the concept of our character quickly.

Dataset - A group of pictures with or without captions which make up all the information or ‘data’ that the AI will learn from.

Captioning - Writing text to help describe or ‘tag’ the images with information that helps the AI figure out what is going on in the image with human help.

Offline - Not connected to, and not needing the internet.

Modular Character - A character that is made in multiple parts which users can use individually.

Apps Used (you can also find them usable online if your PC is weak at MimicPC.com)

Automatic1111 Stable Diffusion AI Image Generation User Interface Tool - https://github.com/AUTOMATIC1111/stable-diffusion-webui
ComfyUI - The User Interface that we will use to generate videos with Wan 2.2 - https://github.com/comfyanonymous/ComfyUI
Wan 2.2 - State-of-the-art AI video generation model - https://github.com/Wan-Video/Wan2.2

You can find all of the above models as an easy-installation ‘one-click’ file by the user ‘Aitrepreneur’. He makes easy-to-use executables which install everything. Go support him!

Overview

Download or create the model
Use it to generate realistic images using Stable Diffusion
Caption (or don’t) the images
Separate them into different datasets and crop them
Train each dataset with AI Toolkit for 4 different models for each ‘part’ and then train the whole character.

Dataset Creation

This section will walk you through selecting the right kind of dataset (images) for your model.

For the first step, we will be training a Low-Rank Adaptation (LoRA) model on SDXL’s Pony DIffusion model. I will teach you how. This will allow us to make images of a realistic version of the character to train Wan on.

First, figure out what kind of character you want to make. If the character’s model already exists online, you can try downloading it from a page like Civitai and then skip to the step for generating your images.
For this example, I have supplied you with the ‘TFM Game Girl’ character dataset. Add it all into one folder.

TFM Game Girl Dataset Creation

I personally sculpted and hand-made a character in 3D, then took around 1,000 virtual photographs of her in different outfits, poses, and hair styles with various backgrounds. Then I trained a LoRA of her in Pony and then used that LoRA model with controlnet to generate realistic versions of her original artwork. I used this to train TFM Game Girl 7. Try out my method with your own art!

We will be using a similar method in this tutorial to create ‘realistic’ versions of toon/game characters in Wan 2.1 and 2.2.

Captioning
You do not ALWAYS have to caption your dataset, but it helps with giving the model flexibility. If your character always comes as holding a purse, you may have trouble getting the model to ‘forget’ the purse in some of the generations, so it’s best to caption the purse into the images.

Use ‘Taggui’ to hand-caption all of your images at once. Hold down the ‘Ctrl’ key and click individual images to select them. For example, I will select all images where the character is holding a purse. Then I will begin to‘tag’ or ‘caption’ them.

Taggui Download:

https://github.com/jhc13/taggui

Tips:

If the view is focused on a character’s face, chest, backside, upper body, etc, I recommend using those tags.

Portrait view
Chest focus
Ass focus
Upper body view
Etc

Caption anything that you want to be able to change, IE the hair style or color.

Some datasets I do not caption anything at all an it’s fine. It’s worth a try to simply not caption at all, if it doesn’t come out right/is too similar to the original dataset, then you should re-train with captions. This goes for your Pony model or your Wan model.

Dataset Recommendations:

My recommendation is to have close, far, and mid-range images from front, side, and behind. For LoRA that need specific detail on body-types, do close-ups on different body parts (IE, chest) so that your future Wan model knows well enough what they look like. Make sure to get good shots of the face from all angles, especially for hair styles, so that Wan does not do any guess-work and make up details on the hair! 20 images if fine, but for the most variety, I recommend 50 images. For the ultimate character, 200+ images is how many I used for TFM Game Girl.

Once you have all of the images and finish captioning, this step is complete.

Training Pony Diffusion

Once you have enough images of your subject and maybe some captions, you should go into Kohya_SS and in the GUI, use my template to begin training. Load the JSON file and it will have everything done for you. All you have to do is set a name and optional training comment.

I recommend naming the model something that doesn’t trick the model into thinking you are prompting. IE RedHairFatMan is not good because it includes ‘man, red, fat’ which may cause the model to automatically trigger those things. Maybe ‘RHFM_Char_PD’ works. The PD is for Pony DIffusion in case you put the model online…that makes it easier for other users (and yourself) to sort, especially for those of us with thousands of models!

Go into ‘dataset preparation’ by clicking the dropdown.

Add in the location of your images, and select a different folder (one NOT inside of the image folder) for the destination of the prepared files. Set the model name the same was what you have up top for the model name. For object class, put ‘woman’ for women, and ‘man’ for men. Should be simple enough in most cases, if you have a character who is an ‘it’ then try ‘person’ or ‘character’.

Now, click the ‘prepare dataset’ button. For very large datasets this may take a while. Check your python text box to see when it finished. When it finished, hit the ‘copy’ button to copy the file locations up top. You’re ready to train!

As a hint, you can increase batch size for stronger graphics cards. An RTX 3090 should be able to do a batch size of 4 with no issues, decreasing the training time.

Simply wait for the LoRA to finish training, go to the output folder that you have set, and move the file into wherever your Pony/Stable Diffusion LoRAs are.

With that, you are done training your Pony LoRA for your realistic character!

Use your new LoRA to generate images using a realistic Pony Diffusion model.

Check either of these two models out, they are my recommendations:

https://civitai.com/models/153568?modelVersionId=2129811

https://civitai.com/models/443821?modelVersionId=2065444

How to Edit Images for Modular Characters

Now that your dataset is complete, you will have to use an image editor. The best one in my opinion is the Windows 11 MS Paint tool. If you have 2 screens, put paint on the left and your images folder on the right. On your 2nd screen, put up Taggui for easy tagging/captioning.

What you will do is create a copy of your ‘realistic’ dataset. Copy it into 1 folder, then duplicate that folder 4 times. You will crop the images in the duplicated folders and keep one ‘original’ without cropping it.

One will be ‘upper body’, one will be ‘lower body’, the third one will be ‘body only’ and the last one will be ‘face’.

For ‘upper body’ crop out the face and lower body for every picture in the folder. Do the obvious for the other models, cropping out the subjects that won’t be in the dataset using your image editor of choice.

Once your 4 datasets are prepared, we will take them into AI Toolkit to train a full character with Wan. If you want to caption them, do so now.

Helpful captions so that your resulting training data isn’t locked into bad camera views:

Full body: None
Head: Portrait view
Body: Head out of frame
Lower body: Lower body focus, head out of frame
Upper body: upper body focus, head out of frame

Once you are ready, you will move to AI Toolkit to begin training your models one at a time for Wan Video. I will provide instructions for training on Wan 2.1 as well as Wan 2.2, it’s mostly the same though.

Training in AI Toolkit for Wan 2.1 and Wan 2.2 (Local)

Training in Wan 2.1 or 2.2 is mostly the same, but Wan 2.2 requires you to train both a ‘low noise’ and a ‘high noise’ LoRA. Some tutorials (Aitrepreneur) recommend training both at the same time, but when I tested that on my 3090, it took 24 hours versus training them 1 at a time for about 5 hours each at 6000 steps.

I trained the same data at 2000-3000 steps for under 2.5 hours and the results were honestly just as fair for our purposes, so I recommend 3000 steps for characters of 250 images or less.

TRAINING OFFLINE

Download all of this here and put it in a folder that we will call ‘Wan22Files’ (with no spaces). https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B/tree/main/low_noise_model
You will need to link to this folder location for Offline later. You must initiate the training ONLINE at least one time to get the required folders in your Cache. Then you can train offline from then onwards.

My recommendations:

Put your images for your first dataset in. Name it in a way that makes sense. Look at my example for how I recommend naming each dataset, using Game Girl as an example.

TFM Game Girl Full
TFM Game Girl Face
TFM Game Girl Body
TFM Game Girl Upper Body
TFM Game Girl Lower Body

In Wan 2.2, the models automatically get named ‘high-noise’ at the end of training, so you don’t have to specify in the model name, but you will have to name them each differently or Ai Toolkit won’t let you proceed.. I recommend naming them ‘ModelName_H’ and ‘ModelName_L’.

Wan 2.2

2) Go into ‘Create New Job’ and do the following:

Name it
Select Wan 2.2 in the models list
Uncheck ‘low noise’ for your first model.
If you are offline, copy and paste the location of your diffusers to the list, otherwise if you are online leave it as it is, it will connect to the internet, download the required models, and work as normal.
Set it to 2000 or 3000 steps
Under ‘Target’ Set ‘linear rank‘ to 16 (this is for lower file size)
Under ‘Training’ set ‘Cache Text Embeddings’ to ‘on’.
Under ‘Datasets’ select your chosen dataset, and set the ‘resolution’ to just 512 and leave the rest off.
Set learning rate to ‘0.0002’ which will give you a more accurate result.
Under ‘Sampling’ set ‘Skip first sample’ and ‘Disable sampling’. Leave everything else off.
Select ‘Create Job’ in the top right corner.
Run your training by hitting the ‘play’ arrow in the top right corner.
With these settings, on a 3090 it takes around 10 minutes to initialize and 1hr 15min to fully run.
Repeat these exact settings with a job where you have the same settings and the model set to ‘low noise’ only. You will need both low and high noise models to use Wan 2.2.
With this, in less than 3 hours both of your models are trained. Move them out of the ‘Outputs’ folder and into your ComfyUI or Wan 2.2 LoRAs folder and give them a try!

Training in AI Toolkit for Wan 2.1 and Wan 2.2 (Online/Cloud)

Use AI Toolkit via MimicPC.com. It is either $1 per hour depending on availability or $2 per hour.

Go to MimicPC.com and log in/register.
Click ‘Add New App’ in the top left.
Ctr + F to search AI Toolkit
Follow the instructions the same as the offline settings, but if you use a more powerful cloud GPU the training may finish much faster than on my 3090. As of 9/6/2025 the privacy policy shows that ALL of your data stays private, so nobody can see what you have uploaded, but if you have an issue

Testing Your Model

I found that with these settings, a strength of 1.2 works just fine.

Try to prompt the character by specifying hair style, eye color, skin tone, everything that matches the training images and see if the results are good. For tests, you can do just 9 frames to see if it LOOKS correct. Move up to 33 frames of video if you feel confident in the model, and when you’re ready, you can make longer videos.

Using Kling AI to Make High-Quality Videos

You can take your training data and insert it into Kling.AI or another service (Hailuo etc) to make animated videos to train off of. Or you can use those videos as a base to use with VACE, using an image of your new character to ‘overlay’ onto the new video.