santa hat
deerdeer nosedeer glow
Sign In

SDXL OC Training with Animagine

SDXL OC Training with Animagine

This is the fifth article detailing my discoveries with training an original character based mainly on synthetic data. Reading the prior articles is not required for this write-up as the training method is based on SDXL rather SD1.5 where I have found some significant differences. I will be mainly discussing training Animagine here along some random notes regarding my prior SDXL experience.

As usual the general disclaimer:

Lowset LoRA/ Single Image LoRA

This is a series where I post my personal findings with training a LoRA with a low or practically non-existent dataset to see what I can come up with. Future posts are mostly likely no longer single image but I am still looking for a minimum effort to retain as much detail as possible while maintaining a flexible LoRA. The overall goal is to make new consistent characters from generations.

As a general disclaimer, these findings may or may not work for you.

The prior articles found be found below:

  1. Part 1

  2. Part 2

  3. Part 3

  4. Part 4

Prior SDXL Experience

A general note that I am inexperienced towards large datasets and mostly my studies with LoRA training is mostly with SD1.5 and low datasets. I have tried training with SDXL when it initially came out but got subpar results where the LoRa was overtrained on style and was rather difficult to pose with. One particular benefit that I found at the time was that SDXL didn't overtrain on style to the point where the colors would become overly saturated and changing clothes was still somewhat easy. Unfortunately, anime SDXL models didn't follow the NAI style prompts/deepbooru tagging so I didn't really like using SDXL on top of the long training times.

Bizarrely, nudity was never a problem for me since the early anime SDXL models would give nudity when I didn't bother prompting for it. Ultimately, the lack of knowledge at the time mostly killed the interest with SDXL. I am trying SDXL again mostly due to other people writing out their experiences with SDXL along with Animagine showing very promising results.

Building up the LoRA

In this section, I will explain how I built up the dataset to my final result. Unfortunately, it's not a very easy to read section if you're only interested in the augmentation or techniques used for the training.

Initial Dataset Preparation

I started off with a general character turnaround consisting of a front,side,back full body views. The side and back aren't extremely important since they're mostly there to tell the LoRA that character can be rotated if there are any other important details that are needed. Ideally, the resolution should be fairly high so that face closeups are not blurry. I initially started at a 1224x1024 and then applied my desired changes to the character design (using editing and inpainting) and then upscaled to 2448x2048 using MultiDiffusion and Tiled ControlNet.

I personally describe this character with medium-level details where it would difficult to just use normal prompts to get her look or outfit. The keypoints here are the triangular pattern on her capelet, a necktie,chest_straps,thigh_strap,trim around the coat,buttons, and a dress_stop with a white shirt. A waist bow was added to check and see if SDXL could handle asymmetry. (I unfortunately forgot to give the character a hat to see if it would be glued to the character). I named her Kie Rose so her related class_name was called kierose.

Afterwards, I split the images into front, side, and back views. Front images were placed into a folder called 9_kierose and the back and side where placed into a folder called 5_kierose. (The number represents the number of the repeats for the image). The rationale for this comes from my prior experience with SD1.5 where the side and back view didn't have too much of an image on the final image. However, a back view with a high repeat is somewhat dangerous as it can force the LoRA to frequently generate back views. Captioning this viewpoint does not seem to solve the problem but "from_behind" can be added to the negative prompt to help alleviate it.

Personally, I don't care too much about hands and feet since those can also be fixed by inpainting.

I am not sure what causes this problem but my hot take: [LoRAs try to learn the newest concept and because back views are rather infrequent in most artworks, it will try to overfocus on backviews if it has too many repeats. Unconfirmed (Will need to test later with loopback training)]

Captioning

Captioning was nothing particularly special as the automatic tagging from kohyaSS was used. Hair color,eye_color, and keywords related to chest were pruned. You can check out the training data on OC models page if you want the specifics. 'kierose' was used as first keyword to act as a trigger word. Classname was the same as the trigger word if that was important. (KohyaSS uses the folder_name as the class_name). The name was combined together to avoid concept bleeding with 'rose.'

SDXL is rather different compared to SD1.5 where the LoRA is still useable without the trigger word. Without the trigger word, the LoRA does not seem to function.

Training Settings

Training was nothing special as well. I mostly used the default settings of KohyaSS to see how well AnimagineXL could perform without too much intervention. I used dim 4 based on the recommendation of this article. Trained for 10 epochs.

edit: It's been a long time since I last updated kohyaSS and I noticed that the default settings have shifted from my previous version. My current setup mimics my older SD1.5 regarding the training rates where learning rate is same as unet LR and text encoder is half of that value. It's a fairly common setup from older guides.

In addition, please set the caption extension to whatever your caption files end with. Kohya used to use .txt by default but an update seems to have changed it.

{
  "LoRA_type": "Standard",
  "adaptive_noise_scale": 0,
  "additional_parameters": "",
  "block_alphas": "",
  "block_dims": "",
  "block_lr_zero_threshold": "",
  "bucket_no_upscale": true,
  "bucket_reso_steps": 64,
  "cache_latents": true,
  "cache_latents_to_disk": false,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "caption_extension": "",
  "clip_skip": 2,
  "color_aug": false,
  "conv_alpha": 1,
  "conv_block_alphas": "",
  "conv_block_dims": "",
  "conv_dim": 1,
  "decompose_both": false,
  "dim_from_weights": false,
  "down_lr_weight": "",
  "enable_bucket": true,
  "epoch": 10,
  "factor": -1,
  "flip_aug": false,
  "full_bf16": false,
  "full_fp16": false,
  "gradient_accumulation_steps": 1.0,
  "gradient_checkpointing": true,
  "keep_tokens": 2,
  "learning_rate": 0.0001,
  "logging_dir":,
  "lora_network_weights": "",
  "lr_scheduler": "cosine",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "lr_warmup": 10,
  "max_bucket_reso": 2048,
  "max_data_loader_n_workers": "0",
  "max_resolution": "1024,1024",
  "max_timestep": 1000,
  "max_token_length": "75",
  "max_train_epochs": "",
  "mem_eff_attn": false,
  "mid_lr_weight": "",
  "min_bucket_reso": 256,
  "min_snr_gamma": 0,
  "min_timestep": 0,
  "mixed_precision": "fp16",
  "model_list": "custom",
  "module_dropout": 0,
  "multires_noise_discount": 0,
  "multires_noise_iterations": 0,
  "network_alpha": 2,
  "network_dim": 4,
  "network_dropout": 0,
  "no_token_padding": false,
  "noise_offset": 0,
  "noise_offset_type": "Original",
  "num_cpu_threads_per_process": 2,
  "optimizer": "AdamW8bit",
  "optimizer_args": "",
  "output_dir": ,
  "output_name": "kie_v1_xl",
  "persistent_data_loader_workers": false,
  "pretrained_model_name_or_path":,
  "prior_loss_weight": 1.0,
  "random_crop": false,
  "rank_dropout": 0,
  "reg_data_dir": "",
  "resume": "",
  "sample_every_n_epochs": 0,
  "sample_every_n_steps": 0,
  "sample_prompts": "",
  "sample_sampler": "euler_a",
  "save_every_n_epochs": 1,
  "save_every_n_steps": 0,
  "save_last_n_steps": 0,
  "save_last_n_steps_state": 0,
  "save_model_as": "safetensors",
  "save_precision": "fp16",
  "save_state": false,
  "scale_v_pred_loss_like_noise_pred": false,
  "scale_weight_norms": 0,
  "sdxl": true,
  "sdxl_cache_text_encoder_outputs": false,
  "sdxl_no_half_vae": true,
  "seed": "",
  "shuffle_caption": false,
  "stop_text_encoder_training": 0,
  "text_encoder_lr": 5e-05,
  "train_batch_size": 1,
  "train_data_dir": ,
  "train_on_input": true,
  "training_comment": "",
  "unet_lr": 0.0001,
  "unit": 1,
  "up_lr_weight": "",
  "use_cp": false,
  "use_wandb": false,
  "v2": false,
  "v_parameterization": false,
  "vae_batch_size": 0,
  "wandb_api_key": "",
  "weighted_captions": false,
  "xformers": true
}

Training Results:

I tried out 10 different ideas to see what would work. I ended up using the 9th iteration as the 10th didn't seem to show any improvement. This is to the best of memory as I didn't take too many detailed notes.

Prompting was rather basic as I reused what was found from the auto-tagger and the negative is what I just copied and pasted from my normal SD1.5 prompts. A couple of the textual inversions do not work according to the warning messages from Foocus and ComfyUI so that shouldn't really matter. I'm generally fairly lazy with my prompts. Emphasis was used on certain aspects that didn't come through well such as the ribbon and dress top.

Iteration 0 Basic Baseline:

With any test, testing the default or dumbest possible configuration is important for establishing a baseline. With a version 0, I trained with only the 3 views. The result was rather predictable as the LoRA did not pick up any details. Well, of course it wouldn't be that easy.

Iteration 1 Closeups:

One of the lessons that I learned from training SD1.5 is that you can add portrait,upper_body, and cowboy_shot variants to help the LoRA capture character details better. This was also mentioned in a different article by narugo1992 and the deepghs group. Their character splitter tool can be helpful towards automating this process or you could always just crop out the viewpoints manually.

The result:

It's actually fairly impressive as SD1.5 would normally struggle with the triangular capelet design since I previously struggled with getting a double-striped pattern with a capelet in the past. I still had to specify color to get the matching outfits. Some further testing with this model was rather interesting since with AnimagineXL, I was still able to rotate the character and change the outfit. In SD1.5, manual regularization or pose training is usually required to achieve this effect so AnimagineXL already doesn't require a large dataset.

The outfit had some limitations where in order to change the character to where a bikini, the following keywords were needed. Simply using bikini alone failed.

[color]_bikini,best_quality,masterpiece,collarbone,bare_shoulders 

Very interesting how with SD1.5 the 'best_quality' and 'masterpiece' were frequently criticized having no effect in SD1.5 but with AnimagineXL, there's a very noticeable impact on NSFW content. According to the CagliostroLab's blog, this effect was actually unintentional but is interestingly helpful for my use case. If this were somehow fixed in the future, I don't think it should be a problem as AnimagineXL already has a trained rating system for NSFW content.

I did test the rating system and found that it works but I'm not posting those images here not get the article labeled as mature.

In comparison to SD1.5, some form of outfit augmentation is required otherwise the clothes will always stick to the character as you try to add more revealing outfits.

The other notable problem with the LoRA is that the flat coloring of the character art is bleeding into background which is making it overly flat. I'm not sure if this is due to artstyle or white_background so I will need to experiment around.

I also run an expressions test and found that they were somewhat stiff and hard to edit. Emphasis or adding a longer description of the face expression seemed to somewhat work but was not very pleasant to work with. I didn't require expression augmentation with SD1.5 so I don't think I will need to run expression augmentations here...

Iteration 2 Outfit Augmentation:

Iteration 2 isn't that important since I wanted to see if I could find a way to get the LoRA to recognize a bikini without any bleeding or long prompts. To achieve this effect, I added bikini and underwear images of my character into the dataset. I placed the alternative costumes into a folder called 3_kierose as I didn't want to place too much importance into the swimsuits. The backview in particular was added to a folder called 1_kierose as a precaution to avoid back views.

The end result of this was that I was able to use bikini as a short prompt rather easily but I did notice a color bias towards what was in the dataset. The ratio of swimsuit to general_outfit at this point is 1:1 or 50%. This isn't important now but it's something to note later, as I ran into another problem later.

Ultimately, the result of this is that outfit augmentation seems to be more of quality of life improvement for prompting rather than a strict requirement. Personally, I edge more on the side of recommending outfit augmentation rather than abandoning it.

Note: At this point, I also decided to check how the epochs have handled style bleeding and I noticed that by epoch 2, style bleeding starts to set in but full character details are not captured until epoch 7.

Can't upload the image due to size restrictions

Iteration 3 More Closeups:

LoRA was still having some difficulties with getting the chest and waist details correctly so I added some headless closeups of the chest and waist area. This did help with getting the details more accurate and getting the ribbon to appear more often but it occasionally caused my character to have head_out_of_frame appear more often. Adding 'head_out_of_frame' to the negative, did help resolve the problem which tells me that I shouldn't rely on headless shots too often. Style bleeding did increase and it became rather difficult to use style keywords to change the artwork style.

Now at this point is now technically deliverable since this iteration has the highest accurate but lowest style flexibility. The flat background can be probably worked around using LayeredDiffusion. (I still need to do more experiments with LayeredDiffusion)

Iteration 4: Style Training.

Iteration 4-a: Style Training

In this iteration, I added pose images from my other character's regularization data to see if it would help resolve the flat background issue. In SD1.5, this dataset was required to help rotate the character. End result: No changes. Probably means that I can't use the same approach with SD1.5 for SDXL.

Iteration 4-b ControlNet Img2img Augmentation:

I used SD1.5 to create different style variants of the frontal cowboy_shot. I didn't make the others since I was lazy. ControlNet augmentation is something that I described in my 3rd article in the series but where I used txt2img instead of img2img. The approach here was somewhat different.

  1. Regional Prompter was used to subdivide the regions to help retain color regions. This not perfect is only meant to help keep the colors more stable.

  2. ControlNet Canny and LineArt were used

    1. ControlNet starting step set to 0.05 (To prevent ControlNet from having too much starting influence on the image; Setting the value too high causes only an outline to show in the generation)

    2. ControlNet ending step set to 0.4~0.5 (To prevent ControlNet from affecting style too much; 0.5 is fine. I did adjust it around 0.4 but it didn't make too much of a difference)

  3. Steps set to 20-30 (May need to increase steps so that model can apply more of its style change)

  4. Flat Color and detail tweaker slider LoRAs where used to help control the level of detail with the image. (I am not sure if this only works with anime_screencap or cell-shaded artworks as they generally focus on simplicity) Tweak Strengths as needed occasionally using 0.1 or -.2 or .8 strength

  5. Denoising strength set to 0.5 ~ 0.7 (Lower the denoising strength if colors are changing too often and then adjust the LoRA strengths instead. I started at 0.8 but since colors started changing too much, I shifted the denoising strength down to 0.5 and then played with LoRA strengths instead)

  6. Cherrypick the best results when it comes to color and adjust settings as needed per checkpoint type. (Colors should match the original image otherwise, they might appear in generations)

You can check out the dataset for which images I picked. I kept the hair_color tag if the tagger picked up a different color from red. (Images where tossed into a 7_kierose folder, I don't exactly remember which repeat folder I placed it in but I still kept the main frontal view by itself).

End Result:

Background style bleeding is resolved. Style can be somewhat edited now but still rather difficult.

There is still style bleeding involved since a semi-realistic style tends to be glossier. CFG of 4 was occasionally needed to produce style shifts. Expressions seemed to loose up at this point but I'm not exactly certain.

Iteration 5 Lower Repeats

Shifted the repeats down to 1,2,3,5. To see if that would have a reduction on background style bleeding.

End Result: No changes but faster training time. Very likely that I do not need as many repeats as I originally expected

Iteration 6 Captioning Order:

Tried to imitate AnimagineXL's tagging order to see if style could be peeled off.

End Result: No changes

Iteration 7 LR Adjustment:

Reduce learning rate to see if that would stop bleeding. End Result: Undertrained.

Iteration 8 Block weight: Adjustment

To see if this could slow down learning.

End Result: No change.

Iteration 9 Simplified Repeats:

At this point, I was struggling to find ways to minimize the style bleeding from the original source so I decided to abandon separating images into separate groups and placed the behind image into a folder by itself and threw everything into a 3_kierose folder. Hopefully, now that the original source is at the same repeats as the augments, the style bleeding should weaken.

End Result:

Semi-Realistic is really glossy now and hints of style bleeding is significantly reduced. One downside is that the character details are no longer as accurate as the original anymore.

Using the original style keyword still helps bring out the original coloring, in this case it's anime_screencap.

At this point, it's far more flexible compared to any SD1.5 LoRA I have ever made. One weird caveat is that lowering the weight below one makes it rather undertrained. But lower weight can be helpful for running a style shift.

I also ran an expression check where it seems easier to work with but it seems like it wants more detailed prompts. (Not completely sure on this)

I decided to check for any odd regressions and I found that I wasn't able to easily change my character into a swimsuit with a single keyword anymore.

Iteration 10: More Skimpy Outfits:

I decided to add more underwear and bikini shots to see if that would help resolve the issue. End Result: Didn't seem to help much. I decided to stop here as Iteration 9 was good enough for my purposes and I somehow ended up with the wrong hair color. There were some cases were it was incredibly difficult to remove the outfit and some cases where it came off easily. I wonder if there's a certain ratio that is needed to more easily strip outfits or if the environment also matters.

Other Observations

Expressions

Seems to be more flexible after the img2img augmentation. The facial outline is rather overtrained. It's somewhat rather visible in a lot of images. I believe I will need some form of loopback training to resolve this problem.

Outfit Swapping

Rather easy to strip a character completely rather than putting them into a swimsuit oddly enough. I speculate that swimsuit has a strong overlap with other outfits in the latent space rather than nudity which is causing some difficulty with swimsuit outfits. Likeness is overall very good for other outfits. Refer to Animagine's model page on how to handle NSFW prompting. Overall, I personally say that outfit augmentation is no longer required but still recommended.

Concept Bleeding:

Capelet has concept bleeding onto certain dress types. The capelet design notably bleed into a Victorian Dress outfit.

I will need to probably need to test adding more outfits of different dresses to help with separation. Some other dresses worked better when used with a longer and more descriptive prompt.

I didn't seem to run into the arms_behind_back and back_view concept bleed issue like I did with my SD1.5 Enna LoRA. Posing seems fine but there is some bleeding over onto 'dress'

Pose

Seems to default to cowboy shot by default. This is more of me being lazy and not adding variety to augmentation. I will probably need to experiment with how much variety is required.

Style Bleeding:

Still exists but somewhat manageable and the LoRA is still somewhat compatible with SDXL style LoRas.

Asymmetry

SDXL doesn't seem to understand asymmetry. Ribbon can appear on both sides.

Prompt Adherence

Works pretty well but since a front view is over represented in the dataset, the LoRA has a tendency to show a frontal view when attempting a side or back view.

Dataset Statistics

  • Total Images: 37 (Supposed to 38 but I accidently deleted a bikini full_body when I was shuffling the repeats around)

  • 14 Cowboy Style Augmentations

  • 3 Outfit Augmentations (Bikini,Underwear)

Augmentations:

  • Camera Augmentation (Full_body,upper_body,cowboy,portrait) (Still Relevant)

  • Outfit Augmentation (Recommended)

  • Expression Augmentation (Not needed (?))

  • Style Augmentation (Still Relevant; Directly training for style using captions doesn't seem to be necessary but will require testing; I will get around to this if I am not feeling lazy)

  • Pose Augmentation (Not needed)

Undertraining

Just a list of things that I noticed was undertrained

  • Belt color

  • Waist Ribbon

  • Necktie pattern

  • LoRA adds a pattern to the wrist cuffs

  • Triangular pattern on capelet

More emphasized when trying to change the artstyle.

Style Workarounds with Img2Img

There are couple of methods with using img2img to get around style bleeding but I believe that should be separate article instead and I haven't found img2img to be a completely perfect tool for handling style transfers. I have no idea if I will be able to write a proper article on img2img techniques.

PonyDiffusion

PonyDiffusion is another very popular model with SDXL. Unfortunately, after following some guides on training with PonyDiffusion I ended up with a LoRA that significantly underperforms AnimagineXL. Posing was stiff and image quality was very low. It seems like PonyDiffusion requires a different approach or at least requires more variety in the dataset. I did find that PonyDiffusion was able to produce NSFW content more easily compared AnimagineXL. I will try to make more time to research this issue when I have the time.

I made some progress regarding PonyDiffusion where I will share my experiences at a later time. It has its own set of benefits and problems compared to AnimagineXL. It won't be as long as this article as I just reused the dataset. My main interest with PonyDiffusion is more for workflow extensibility with increased LoRA variety rather than base model comparisons.

SD1.5 Considerations

Is SD1.5 obsolete?

I am still using SD1.5 for style transfer since the SD1.5 ecosystem is still richer in terms of style variety. A SD1.5 LoRa is still usable but would be repurposed for keeping character stability for faster img2img or inpainting. Training a SD1.5 LoRA crudely in this manner could be done since the SD1.5 would be mainly used at strengths <0.5. The SDXL LoRA would do the bulk of the work since SDXL has better prompt comprehension. and SD1.5 LoRa would be used for final post-processing.

What about using a checkpoint of the same artstyle?

Unfortunately, this did not work out for me and resulted in overtraining and overly saturated colors.

Lycoris Training

I haven't tried Lycoris with SDXL yet but Lycoris SD1.5 had very disappointing results with low likeness and a lot of color saturation.

As a refiner

SD1.5 is still useable as a refiner. Although there was some point in the dataset where the LoRA was incompatible. Fooocus unfortunately doesn't save metadata so I don't know which iteration before the 9th failed. Swappable at around 0.5 ~ 0.667 steps. Seems to work better towards the latter range.

Summary of Findings:

AnimagineXL is an incredibly impressive training model and far exceeds what I was able to do with NAI SD1.5 in past. Some augmentation techniques such multiple camera viewpoints, and img2img CN style augmentation still seems to be needed to allow more flexibility and improved LoRA quality. Outfit augmentation is something that I recommend to avoid concept bleed but is no longer strictly necessary if you just want to make swimsuit outfits with your OC. This is still a work in progress so some underfitting and overfitting is to be expected so after some more optimizations, I may post an update.

Any corrections or tips would be helpful. I'm not the brightest person in the world when it comes to training LoRAs and this was something I discovered more by chance than anything else. Thanks for reading and hopefully this helps somebody out and have fun your with OCs!

Special Thanks to:

  • CagliostroLab for AnimagineXL3.0

  • Chenkin, narugo1992 and rerorerorero for related training articles

Changelog:

-Minor typos and grammar mistakes

-3/12/2024: Update to training settings and notes on PonyDiffusion. Very likely that I will need to write a separate article for PonyDiffusion. It is unlikely to be as extensive as this article.

26

Comments