Anima LoRA training settings for sd-scripts

Introduction

This is not supposed to be a full guide for training a LoRA with anima. I just want to share the basic settings I am using for anima training, along with some additional information about datasets and tagging. I am not claiming this is the best way to train an anima LoRA or that it will work with any dataset, but these settings have worked more or less good with every dataset I have used till now (see training examples at the end of the article). There is occasionally the usual trial and retrain or train further you encounter when training a LoRA, but for me it works generally better than with illustrious so far. Other than anima I only really have experience training LoRAs with illustrious, so I just can compare it with that. Sd-scripts is a script based training tool for various models with a lot of different configuration options. For installation please refer to the sd-scripts github page. It can be used with Windows and Linux and is the engine behind the kohya training tools on civitai.

Datasets

For LoRAs with an illustrious and anima version, I have used the same dataset and tagging without problems training the anima version.
I am often using anime screencaps for training character LoRAs, sometimes mixed with additional artwork for better style diversity. For simple character LoRAs I usually have 20-30 images, or some more when using a mix of screencaps and other artwork. For multi outfit LoRAs I have normally used 10-20 images (if available) per outfit with illustrious. Though I have done some experiments with that using anima.
For style LoRAs I have been using 60-150 images. For concepts it depends on the dataset available, but I usually try to go for 20+ images as well.

Captioning

I use the same danbooru style tagging for anima training as I do for training illustrious.
For that I am employing an auto-tagger model for a first pass and do manual adjustments in the second pass. I am using WD ViT-Large Tagger v3 for auto-tagging. It’s a bit outdated, but still giving good results for danbooru tagging overall and it runs on about every machine locally (no GPU required).
After that I am adding trigger words and outfit tags manually, while removing false tags and pruning redundant tags. For running the auto-tagger model, I am using TagGUI and for manual rework I am using BooruDatasetTagManager, since it has better functionality for that than TagGUI. With anima I started to use the tags „anime screenshot“ and „anime coloring“ for tagging screencaps since anima can contain the style well within these tags.

Dataset Config

For training a LoRA with sd-scripts you need a dataset config file saved as .toml. A simple config file looks like this:

[general]
flip_aug = false
color_aug = false
resolution = [1024, 1024]

[[datasets]]
batch_size = 1
enable_bucket = false
bucket_reso_steps = 64
max_bucket_reso = 2048
min_bucket_reso = 256
bucket_no_upscale = true
caption_extension = ".txt"
shuffle_caption = true
keep_tokens = 2
caption_tag_dropout_rate = 0.1

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\NamiSyrup_v3'
class_tokens = 'nami'

These are just general dataset configs. I’ll explain them in the following:

flip_aug: This artificially doubles the training images by flipping them horizontally. Only set to true if you have very few images and no asymmetrical character traits.

color_aug: Similar to flip_aug, but with changing colors. Should always be disabled unless you have a good reason for it.

resolution: This is the maximum resolution the LoRA is trained with, i.e. for [1024, 1024] images will be trained at ~1MP resolution. I recommend to use this setting here, but for character LoRAs you may use [768, 768] without to much damage if you are short on VRAM. Decreasing training resolution may cause problems using the LoRA at generating images in high resolution, but it decreases VRAM usage and speeds up training.

batch_size: Number of images that are trained in parallel. I recommend to keep this at 1, for higher values VRAM usage is increased by a lot. Increase this only by factors of 2, i.e. 2, 4, 8, 16. Increasing this will speed up training, if you have sufficient VRAM, but may degrade output quality. Only really recommended to change this if you are finetuning a checkpoint with a large VRAM GPU.

enable_bucket: Should normally be set to true. Creates buckets for images and crops images to the best fitting bucket resolution. This increases training speed and aligns with training of the base model for better compatibility. I only set this to false, if the images in the dataset all have the same resolution.

bucket_reso_steps, max_bucket_reso, min_bucket_reso: These parameters define the bucket dimensions. Should be left at the default values above.

bucket_no_upscale: Only set this to true if you have some images at lower resolution than the training resolution in the dataset and do not want them to be upscaled. If set to true it will create additional lower resolution buckets, if there are images smaller than the trainng resolution.

caption_extension: Extension used for the caption files.

shuffle_caption: Shuffles the tags in the image captions between two epochs. This increases flexibility of the LoRA.

keep_tokens: Number of tags at the begin of image caption that are not shuffled. If you have a trigger word, put it at the begin of the caption and set this to 1 at least (I usually have this set to 2 for simple character LoRAs and 3 for multi-outfit LoRAs).

caption_tag_dropout_rate: Drops tags in the caption with the given rate, e.g. 0.1 drops 10% percent of the tags randomly and only trains with the remaining tags. Helps with generalization.

image_dir: Path to the directory containing the training images and caption files.

class_token: Is used as caption, if there is no caption file for an image. If all images have caption files, this is redundant. If you have a trigger word, you should use it as class token, just in case.

If you have multiple image subsets, a config file will look something like this:

[general]
flip_aug = true
color_aug = false
resolution = [1024, 1024]

[[datasets]]
batch_size = 1
enable_bucket = true
bucket_reso_steps = 64
max_bucket_reso = 2048
min_bucket_reso = 256
bucket_no_upscale = true
caption_extension = ".txt"
shuffle_caption = true
keep_tokens = 3
caption_tag_dropout_rate = 0.1

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayPinkBikini'
class_tokens = 'may'

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayGreenBikini'
class_tokens = 'may'

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayFrilledBikini'
class_tokens = 'may'

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayRedBikini'
class_tokens = 'may'
num_repeats = 3

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayYellowBikini'
class_tokens = 'may'
num_repeats = 3

[[datasets.subsets]]
image_dir = 'C:\path\to\image\directory\MayCheckeredBikini'
class_tokens = 'may'
num_repeats = 4

This time I used a second trigger word for the outfit in each caption file, so I set keep_tokens to 3 (caption looks like may, 1girl, grbk,…). Now I have specified 6 dataset subsest for 6 different outfits.
An additional parameter here is num_repeats. It specifies how many times an image is trained on during one training epoch. Since I did not have enough images for each outfit, I set some subsets to values larger than 1 (1 is the default) to get equal repeats for each outfit.
Parameters like num_repeats or keep_tokens can be set for the whole dataset and/or for subsets, if set for both the subset value takes precedence. For more information about dataset config files check the sd-scripts wiki.

Command line arguments (Training Parameters)

A training run with sd-scripts is normally started from the command line. However since there are a lot of training parameters to be passed to the training script, I am preparing a batch file and execute the file to start the training. This way you also can specify several different training runs in sequence, which comes in handy if you want to train several LoRAs without having to start every single run separately, e.g. when training overnight. A batch file on Windows is a text file saved with .bat extension, .sh is the extension on Linux. A training batch file on Windows looks something like this:

call ./venv/Scripts/activate

accelerate launch --num_cpu_threads_per_process 1 anima_train_network.py ^
  --pretrained_model_name_or_path="C:\path\to\base\model\anima_baseV10.safetensors" ^
  --qwen3="C:\path\to\text\encoder\qwen_3_06b_base.safetensors" ^
  --vae="C:\path\to\VAE\qwen_image_vae.safetensors" ^
  --dataset_config="C:\path\to\dataset\config\file\panchybriefs_v1.toml" ^
  --output_dir="./output/Anima/PanchyBriefs_anibase_v1" ^
  --output_name="PanchyBriefs_anibase_v1" ^
  --log_with=tensorboard --logging_dir="./logs/PanchyBriefs_anibase_v1" ^
  --save_model_as=safetensors ^
  --network_module=networks.lora_anima ^
  --network_dim=4 ^
  --network_alpha=4 ^
  --network_dropout=0.1 ^
  --network_args rank_dropout=0.1 module_dropout=0.1 ^
  --learning_rate=1e-4 ^
  --optimizer_type="AdamW8bit" --optimizer_args weight_decay=0.11 betas=(0.9,0.99) ^
  --lr_scheduler="cosine_with_restarts" ^
  --lr_scheduler_num_cycles=3 ^
  --lr_scheduler_power=1 ^
  --lr_warmup=5 ^
  --max_train_epochs=60 ^
  --save_every_n_epochs=60 ^
  --gradient_checkpointing ^
  --cache_latents ^
  --max_data_loader_n_workers 0 ^
  --metadata_author="RisingV" ^
  --metadata_description="trigger words: 1girl, panchy, blonde hair, short hair, curly hair, closed eyes, earrings, tube top, jeans, green belt" ^
  --mixed_precision="fp16" ^
  --min_snr_gamma=1 ^
  --multires_noise_discount=0.1 ^
  --multires_noise_iterations=6 ^
  --noise_offset=0.03 ^
  --ip_noise_gamma=0.1 ^
  --save_precision="fp16" ^
  --scale_weight_norms=1 ^
  --seed=0 ^
  --xformers ^
  --split_attn ^
  --vae_disable_cache


pause

The batch file has to be placed in the sd-scripts install directory. It can be executed like any .exe file on windows (selecting and hitting enter or double-click).
The first line in the batch file “call ./venv/Scripts/activate” (on Linux replace this with “source venv/Scripts/activate”) activates the virtual environment you need for training (e.g. pytorch and other packages). The next command “accelerate launch --num_cpu_threads_per_process 1 anima_train_network.py” starts the training with the anima_train_network script.
All the other lines are parameters we pass as flags (indicated by preceding “--”) to the training script. Normally you would have to write all the parameter flags in the same line as the training command, but since this is inconvenient for reading, we use line break with a “^” symbol at the end of each line (on Linux you have to use “\” instead).
In the following I try to explain what the different parameters (flags) do (as much as I know about):

--pretrained_model_name_or_path, --qwen3, --vae:
These are for setting the paths to the base model components (transformer, text encoder, vae) you need for training.

--dataset_config: Path to the dataset config file.

--output_dir, --output_name: Directory and name for storing the output (LoRA) files. Directory will be created if it’s not already there.

--log_with=tensorboard,--logging_dir: Training will be logged using tensorboard if specified and logging_dir gives the directory for logging files. Use this if you want to see loss graph and learning rate over time.

--save_model_as: File format of the LoRA file. Should be left at safetensors.

--network_module: Module used for training. Since we are training an anima LoRA set to networks.lora_anima.

--network_dim: The network dimension of the LoRA. This determines the number of parameters in the LoRA to be trained and thus the size of the LoRA file.

--network_alpha: The alpha value of the LoRA. A parameter affecting learning rate and accuracy of the LoRA weights. Look here for a more detailed explanation.

--network_dropout, --network_args rank_dropout module_dropout:
Dropout values are used as a regularization technique to prevent overfitting. Under --network_args flag additional parameters can be specified.

--learning_rate: The global learning rate. This is the learning rate for the transformer and the text encoder, the learning rate of the llm_adapter is set to zero by default, regardless of the global learning rate (training the llm_adapter is considered as bad practice, since it can lead to "catastrophic forgetting"). A separate learning rate for the text encoder can be specified using “--text_encoder_lr”, if you want to use a different learning rate for transformer and text encoder.

--optimizer_type, --optimizer_args: Specifies the optimizer and passes arguments to it.

--lr_scheduler, --lr_scheduler_num_cycles, --lr_scheduler_power, --lr_warmup:
These parameters control how learning rate changes over time. Using “cosine_with_restarts” the lr starts at the specified learning_rate and drops to zero like a cosine function. This is repeated the number of times specified by --lr_scheduler_mu_cycles”. Using the warmup flag there will be a warmup phase at the begin of the training (number giving the percentage of total steps used for warmup) to prevent overfitting.

--max_train_epochs: Number of epochs to train. Instead you can specify the total number of steps with “--max_train_steps”, “--max_train_epochs” takes precedence.

--save_every_n_epochs: Saves network weights (LoRA weights) in a separate file every n epochs. Alternatively you can specify “—save_every_n_steps”.

--gradient_checkpointing, --cache_latents:
These flags helps to save VRAM. Using “--gradient_checkpointing” slows down training a bit, but disabling it will increase VRAM usage significantly, so set this flag unless you have a lot of VRAM.

--max_data_loader_n_workers: Specifies the number of processes for multi-process data loading. Leave at zero to disable it on Windows, since the pytorch implementation for Windows is bad for this and causes significant slower training (for me it doubled training time without setting it to zero). The multi-process data loader probably is useful only if you train on a large amount of images (like 10k+).

--metadata_author, --metadata_description:
You can add some metadata to the LoRA file with this. Useful if you want to make the LoRA public. In that case you should add your alias to mark the LoRA as your creation. I usually also add trigger words in the metadata.

--mixed_precision: This improves training speed and reduces VRAM usage by doing mixed precisions computations. Should always be used if supported by your GPU (some older GPUs might not support it). “fp16” and “bf16” are both valid options, though early anima implementation in sd-scripts had problems with “fp16”, so make sure you have the latest version of sd-scripts installed.

--min_snr_gamma, --multires_noise_discount, --multires_noise_iterations, --noise_offset,
--ip_noise_gamma:
Various parameters using noise to improve stability, learning details, regularization. I never tried to alter the values used above. For more information please refer to the sd-scripts docs.

--save_precision: Number precision the LoRA is saved with. Defaults to training computation precision.

--scale_weight_norms: Helps to avoid overfitting. Default value is 1.

--seed: Set a specific value for random training seed. If not set random seed will be determined.

--xformers, --split_attn:
Sets attention mechanism to xfomers and “--split_attn” is required for it. You can use a different attention implementation using "--attn_mode", please refer to the wiki for that. Decreases VRAM usage and increases training speed. If you want to use xformers, sage-attention or flash-attention, you need to install the respective packages before.

--vae_disable_cache: Saves some VRAM.

The pause command at the end of the file keeps open the cli terminal displaying the training progress after training is finished. There are a lot of other settings available with anima training in sd-scripts, e.g. for further reducing VRAM usage or generating sample images, please refer to the docs in the sd-scripts repository for that.

Most of the settings above I do not change. The only settings I am varying depending on dataset and purpose are --network_dim, --network_alpha, --learning_rate and –max_train_epochs (steps). So which settings do I use for these?

For character LoRAs I normally use network_dim=network_alpha=8 (though a value of 4 for both may be enough for simple LoRAs) and learning_rate=1e-4. With this values 1000-1500 steps are enough to learn the character sufficiently well (with 20-30 images). I also did some training runs with more steps and the LoRA did not overfit.

For multi-outfit character LoRAs more steps are needed to learn the outfits properly (see example below).

For style LoRAs I have used network_dim=network_alpha=16 with a learning rate of 5e-5 and 6500 steps using a dataset of 65 images. Style seems to profit from lower learning rate and more steps to learn details better.

The settings for concept LoRAs depend largely on purpose and dataset. For the concepts I have trained so far, I used dim=alpha=8 or lower and a learning rate of 1e-4.

Training Examples

Since every dataset and LoRA is different, here are a few examples of LoRAs I trained. I have attached the dataset config files and training settings (batch files) for these to the article.

Simple character LoRAs:

Panchy Brief
dataset: 20 images of different style and various outfits (one outfit trained)
base model: anima-base-v1.0
dim: 4
alpha: 4
lr: 1e-4
epochs: 60
total steps: 1200

Nami (Syrup Village Arc)
dataset: 21 images (anime screencaps only)
base model: anima-base-v1.0
dim: 8
alpha: 8
lr: 1e-3
epochs: 60
total steps: 1260

Bibi Blocksberg
dataset: 30 images (17 cartoon screencaps in low resolution and 13 artworks of various style, all images with the same outfit)
base model: anima-preview3
dim: 8
alpha: 8
lr: 1e-4
epochs: 100
total steps: 3000

Multi-outfit character LoRA:

May (7 beach outfits)
dataset: 59 images (12+12+12+12+4+4+3 distribution over different outfits, various styles)
base model: anima-base-v1.0
dim: 8
alpha: 8
first training run:
lr:1e-3
epochs: 75
steps: 6300
steps per outfit: 900

second training run:
lr: 5e-5
epochs: 15
steps: 1260
steps per outfit: 180

third training run:
lr: 5e-5
epochs: 10
steps: 840
steps per outfit: 120

fourth training run:
lr: 5e-5
epochs: 10
steps: 840
steps per outfit: 120

I trained this with 900 steps per outfit in the first run at lr 1e-4 and then used the LoRA file of the first run to train further (you need to specify “--network_weights” with a path to the file for that in the next training run). In the second and subsequent third and fourth training runs I used a lower lr to train the details of the outfits until I was satisfied with the accuracy.

Style LoRA:

Kyhu Style
dataset: 65 images (drawings and sketches, Korra was present in most of the images, but since anima already knew her it wasn’t much of a problem), used trigger word in the captions
base model: anima-preview3
dim: 16
alpha: 16
lr: 5e-5
epochs: 100
total steps: 6500

Concept LoRAs:

Blues Brothers Outfit
dataset: 18 images (various styles and characters wearing the outfit)
base model: anima-base-v1.0
dim: 8
alpha: 8
lr: 1e-4
epochs: 75
total steps: 1500

Poké Doll
dataset: 60 images (various styles with different settings and sizes of the object including images of the object alone and in interaction with characters)
base model: anima-base-v1.0
dim: 4
alpha: 4
lr: 1e-4
epochs: 30
total steps: 1800

0w0 (w shaped mouth with solid white eyes)
dataset: 30 images (different characters, mostly chibi style)
base model: anima-base-v1.0
dim: 8
alpha: 8
lr: 1e-4
epochs: 40
total steps: 1200

Union Berlin Soccer Jersey
dataset: 20 images (realistic promo photos and renders)
base model: anima-preview3
dim: 8
alpha: 8
lr: 1e-4
epochs: 100
total steps: 2000

Heart hands over mouth gesture
dataset: 7 images (variations of one image)
base model: anima-base-v1.0
dim: 2
alpha: 2
first training run:
lr: 1e-4
epochs: 20
steps: 140

second training run:
lr: 1e-4
epochs: 20
steps: 140

third training run:
lr: 5e-5
epochs: 20
steps: 140

Since I only trained on one basis image here I did not know how many steps I needed for the concept to bake in (without overfitting), so I had to make three training runs.

Conclusion

In this guide I provided training settings for training LoRAs with the Anima base model using sd-scripts. The given examples show that the basic settings can be used for a varity of (small) datasets and concepts only adjusting dim (rank), learning rate and training steps parameters to the specific use case. At least I am mostly satisfied with the training results. Of course the LoRAs I trained are still a small sample size and finding good settings for different datasets/LoRAs is a work in progress.