Sign In

Valstrix's Crash-Course Guide to LoRA (& LyCORIS) Training

Valstrix's Crash-Course Guide to LoRA (& LyCORIS) Training

After copy-pasting a guide I wrote in discord several times, I think it's time I consolidated and expanded on it here on Civit. With so many guides out of date and/or with incorrect information, I hope this will be helpful to aspiring and current trainers alike.

Keep in mind this is based off of my own personal workflow and experiences! It will likely be by no means perfect, and I don't plan to make this an absolute encyclopedia, either. Just a good ol' crash-course to get your feet wet.

I am also going to assume you have a system capable enough for local training. (Though this theoretically should be mostly applicable to training in other environments.)

That being said, this guide will assume you haven't even gathered a dataset yet, so let's dive in!

Part 1 | Datasets: Gathering & Basics

Your dataset is THE MOST IMPORTANT aspect of your LoRA, hands down. A bad dataset will produce a bad LoRA every time, regardless of your settings. Garbage data in gives garbage data out!

Ideally, training a good LoRA will use 30 images at a minimum, but you can use more or less. IMO, more than 50 is overkill, and going less than 15 can make things overly difficult, but I've trained with as few as 8 before. Staying near that 30-image sweet spot is ideal, though. There is a method to artificially expand your dataset, but we'll touch on that later.

When assembling your images, ensure you go for quality over quantity. A well-curated set of 30 images can easily outperform a set of 100 poor and mediocre images. Especially in smaller datasets, a single "bad" image can offset the entire model in a bad way. That being said, bad images CAN be used to pad out a dataset, but should be tagged properly (such as with "colored sketch", which will be talked about later).

You should also ensure your images are varied. Too many images of a similar/same style will try and bake itself into your concept, making changing the style exceptionally difficult, and making any style changes biased. Especially when dealing with lots of screenshots and renders, you should be careful. If you do have a significant amount, tagging them with the artist that made them, as a render, etc, can help tie the style to another tag and reduce the impact.

I would also recommend avoiding fetish-themed images when working with characters (unless you want that out of your LoRA), as even when tagged their often extreme anatomy can skew your model in a way worse than if they weren't present at all. You can of course use them to expand your dataset if you truly need to, but make sure they are tagged thoroughly.

Personally, I gather my data from a variety of sites: e621, furaffinity, deviantart, pixiv, and the monster hunter wiki (and game wikis in general) are my common sources. Again, make sure you try and avoid pulling too much from the same artist and similar styles. Google image search is also worth looking at if you need more data, as it can often find isolated instances from reddit, steam community feeds, and other sites you may not have thought of looking through.

Pixiv is a godsend for monster hunter & other eastern franchises as a JP art site, but finding specifics can be difficult at times as you need to use japanese text to guarantee your search. Thankfully, a number of wikis also include japanese names if applicable.

As you gather your images, you should consider what is it you want to train: A character? A style? An object? Some clothing?

For most of these, you should primarily look for solo images of the subject in question. Duos/Trios also work, but you should only grab them if your primary subject is largely unobscured. Alternatively, extra individuals can be easily removed or cropped out. If you do include multi-character images, make sure they are properly and thoroughly tagged. Including duo/trios/etc can be beneficial to using your LoRA in multi-character generations, but is not required by any means.

If you're planning on training a style, know that those are more advanced, and are better suited to LyCORIS than LoRA. This guide will still be largely applicable, but check the later parts for details specific to them.

Once you have your images, place them in a folder for preparation in the next part. (For example, "Training Folders/Concept Folder/Raw")

Part 2 | Datasets: Preparation

Once you have your raw images from part 1, you can begin to preprocess them to get them ready for training. You will need a photo editor program handy: I recommend Photopea as a free web alternative to photoshop. Paint.net & Krita are both valid options, as well.

Personally, I separate my images into two groups: Images that are ok on their own, and images that require some form of editing before use. Those that meet the below criteria are moved to another folder and then edited accordingly.

Take a look at the extension of your images. .webp images (usually pulled of wikis) are incompatible with current trainers, and must be converted to a PNG or JPEG. While you do that, note that images with transparent backgrounds also cause issues. These should be brought into your image editor of choice, and should be given a background. I would recommend using multiple solid colors, if you have more than one - background variation can be incredibly useful. Alternating between white/black/blue/green/red/etc, and tagging them as such, can help your training if backgrounds are causing issues, and just in general.

Next, consider your training resolution. Higher resolutions let you get more detail out of an image, but will slow your training time. Most people train at 512 or 768, but intermediary resolutions are also applicable, such as training at 704 if you can't fit 768. Any image that is larger than your resolution will be scaled down automatically. The resolution will be detailed further later.

Once you have an idea of your resolution, take a look at your dataset. Keep in mind that non-square images will be scaled to maintain their proportions, so having lots of empty background can be detrimental to getting details. Wide and tall images with lots of empty background can be cropped to focus on the subject.

Additionally, if you have hi-res images with multiple depictions of your subject (like a reference sheet) you can crop the image into multiple parts to have it trained over several images over one overly compressed one. Such images can also be trained without cutting and cropping, just be sure to tag them with "multiple views" and "reference sheet" later on.

Images with people other than the subject should be focused on the subject if you plan to use them as they are, or if you plan to turn them into solo images edit out the other subjects if possible, be it via cropping or removal.

While not necessary, overlaying elements such as text, speech bubbles, and movement lines can be removed. You should remember that AI learns off of repetition, and the same element in the same spot on multiple images will be something it tries to hold on to. It's alright if a handful have them, but ideally you want as little repetition among them as possible, and those that can't be removed but repeat often should be tagged. Since repetition is key, outliers are less likely to stick. The magic eraser tool is very useful for any of these that aren't on a flat color background.

If you have images smaller than your training resolution, consider upscaling them. Upscalers like 4x_Ultrasharp are great for this.

If you have images larger than 3k pixels, downscale them to 3k or less. Apparently, the Kohya trainer has some minor issues handling very large images - I'm not entirely sure why, but downscaling oversized images in my dataset showed some improvement.

Lastly, if you have a subject with asymmetrical details (like a marking, logo, single robot arm, etc), make sure it is facing the same way in each image. Images incorrectly oriented should be flipped for consistency. If this is the case, make sure you don't enable "flip augmentation", detailed further in.

Once you've done the following, place all of your images that you edited, and the images you didn't edit, in a new folder like so: "Training Folders/Concept Folder/X_ConceptName" the 'ConceptName' will be your instance token (what you prompt with), and the 'X' will be the number of repeats on their folder per epoch, which will be detailed later. It should look something like "1_Hamburger".

Part 2.5 | Datasets: Curing Poison

Since Nightshade especially is getting a lot of traction right now, I figure I'll put a section here covering "poisoned" images. You won't run into these too often, but it's quite possible as they increase in popularity.

The purpose of image poisoning tools like Glaze and Nightshade is to add "adversarial noise" to an image, which disrupts the learning process by effectively adding insane outliers and obscuring the original data it would train on. As such, including a poisoned image in your dataset can result in strange abnormalities, be it color variations, distorted anatomy, etc. The more poison you have, the worse the effects will be. Ideally, you don't have any - but you CAN still use them.

These "poisons" have a hilarious weakness - the very noise they're introducing. By simply taking the desired image, and putting it through an AI upscaler good at denoising (like with jpeg artifacts or the sort), or even just a general upscaler, you can just... strip away the poison. It's that easy, usually. People are still experimenting with the "best" methods for removal, but frankly, especially with Nightshade, pretty much any method can clean the image to a usable state.

"Smoothing" or "Anti-artifact" upscalers work best for the job, used with one of the two following methods:
A: Just upscale it. 2x is usually fine.
B: Downscale to half or 3/4 size, then upscale with AI. Works best with already large images, small resolution images would lose too much detail.

Alternatively, "adverse cleaner" can do a decent job, and exists as an extension for A1111 or as a HF Space. Combined with the upscaling methods above, you can effectively neutralize the "poison" entirely.

"But how do I recognize a poisoned image?"

It depends on how aggressively the work was poisoned - If it looks like shit because it looks like a 3yr old put a silly camera filter on it, or it has some pretty obvious artifacting, or looks like the entire image is covered in a Jpeg compression artifact - It's 9/10 times poisoned. Less aggressive poisons are harder to detect, but have less of an impact on your training. If you're unsure, take a close look at it in a photo editor, and/or just run it through the cleaning methods before to be safe.

As a general note, the individuals who put hyper-aggressive poisons on their work are usually delusional enough their art isn't even worth using in the first place - Self-respecting artists generally keep the poison minimal to not affect their work visually in any major way, or just don't use it. If you don't feel like dealing with poison, pull your data from older images if you want to be wholly safe, or just learn to identify poisons and skip by them.

Part 3 | Datasets: Tagging

Almost done with the dataset! We're in the final step now, tagging. This will be what sets your instance token, and will determine how your LoRA is used.

There are a variety of ways to do your tagging, and a multitude of programs to assist tagging or do auto-tagging. However, in my personal opinion you shouldn't use auto tagging (especially on niche designs and subjects) as it makes more work than it assists with. (However, auto-taggers are improving rapidly.)

Personally, I use the Booru Dataset Tag Manager and tag all of my images manually. You COULD tag without a program, but just... don't. Manually creating, naming, and filling out a .txt for every image is not what you want to do with your time.

Thankfully, BDTM has a nice option to add a tag to every image in your dataset at once, which makes the beginning of the process much easier.

Before you tag, you need to choose a model to train on! For the sake of compatibility, I suggest you train on a Base Model, which is anything like a finetune that is NOT a mix of other models. Training on a mix is still viable, but in my experience makes the outputs less compatible with anything not that model. If you only want to use your LoRA on THAT specific mix, you're perfectly fine to train on it, however.

Now, for the tagging itself. Before you do anything, figure out what type of tags you'll be using:

  • Currently, there are three types of prompting styles, as follows: Natural Language Prompting, (dan)Booru Tag Prompting, and e6 Tag Prompting, which you should use based on your models "ancestry."

  • Base SD1.5, Base SDXL, and (currently) most SDXL models use Natural Prompting, ex: "A brown dog sleeping at night, they are very fluffy." Sometimes works on other models, but is not recommended.

  • The vast majority of Anime models you see use Booru prompting, specifically using the tag list from Danbooru, an anime image board. I hear Anything v4.5 is a good choice.

  • Models with an ancestry based on furry models use e6 prompting, using the tag list from e621. Fluffyrock or BB95 is a good choice here.

Once you know what model and tags you're using, you can start tagging.

Your FIRST tag on EVERY image should be your instance token, aka what you named X_ConceptName, in this case "conceptname". If your model already has your subject even remotely trained to that tag already, consider changing your instance token to a string it wouldn't have. For example, "hamburger" could be "hmbrgrlora". This isn't always required, but if you see wacky results that stem from the models original interpretation, you might want to do so.

Your SECOND tag on EVERY image should be your class token, a general "container" for your instance token. This tells the AI what your subject is, generally, to aid in training. For example, a sword is a "weapon", Lola Bunny is an "anthro" and so on. Not every image needs to have the some class, token, however! I often run mixed datasets, and having multiple classes for one instance token is perfectly ok - so long as your model can make sense of it.

My process works something like:

  • Add to all images at once: instance token (usually species), class token (anthro/feral/human), gender (if applicable, ferals not specified), controllable elements (ie. a character-specific outfit), nude, other common controllables (like most common eye color).

  • Move to first image; Remove if needed: controllable elements. Change if needed: nude (to general outfit tag(s)), eye color, etc.

  • Add tags you would consider to be "key" elements to the image: Specific mediums (like watercolor), compositionals (ie three-quarter portrait), etc.

  • Add tags to describe deviated aspects: huge/hyper breasts, horns/scales/skin of varied color, etc.

  • Repeat for each image.

That being said, don't go overboard with your tags. If you use too many, you'll "overload" the trainer and get less accurate results, as it's trying to train to too many tags. It's generally best practice to only tag items you would consider a "key element" of the image. Undertagging is better than overtagging, so if in doubt keep it minimal. I usually have ~5-20 tags per image, depending on their complexity.

Backgrounds and poses can often be ignored, but if you have specific kinds of locations/poses/BGs in a significant number of your images, you should tag them to prevent biasing.

For example, if you have a lot of white backgrounds, you should tag "white background". If after a training you see a specific pose being defaulted to, you should find all instances in your dataset using that pose and tag it.

You should also be wary of "implied" tags. These are tags that imply other tags just by their presence. By having an implied tag, you shouldn't use the tag(s) it implies alongside it. For example, "spread legs" implies "legs", "german shepherd" implies "dog", so on and so forth. Having the tags that are implied by another spreads the training between them, weakening the effect of your training. In large quantities, this can actually be quite harmful to your final results.

Tagging low-quality images: Sometimes, you just don't have a choice but to use poor data. Rough sketches, low-res screenshots, bad anatomy, and others all fall into this category.

  • Sketches can usually be tagged "colored sketch" or "sketch", which usually is all you need to do. If uncolored, "monochrome" and "greyscale" are usually good to add, as well.

  • Low resolution images should be upscaled with an appropriate upscaler if possible, such as one of the many made to upscale screenshots from old cartoons, for example. If you can't get a good upscale, use the appropriate tag for your model to denote the resolution quality.

  • Bad anatomy should be tagged as you see it, or cropped out of frame if possible. Images with significant deviations that can't be cropped or edited, like the neck/head/shoulders being off-center, those are usually best left out of the dataset entirely.

Once you've tagged all your images, make sure you've saved everything and you'll be good to go for the next step.

Part 3.5 | Datasets: Prior Preservation (Regularization)

While completely optional, another method of combating style bias and improper tag attribution is via the use of a Prior Preservation dataset. This will act as a separate but generalized dataset for use alongside your training dataset, and can usually be used generally between multiple training sessions. I would recommend creating a new folder for them like so: "Training Folders/Regularization/RegConceptA/1_RegConceptName".

"But how exactly do I make and use these?"

You can start by naming your folder after a token - your class token is often a good choice.

Creating a dataset for these is actually incredibly easy - no tagging is required. Within the folder you created for the tag, you simply need to put in a number of random, unique, and varied images that fall within that tag's domain. From my own testing, I personally recommend a number roughly equal to the number of images in your main dataset for training, but keep a larger folder of ~50 images to pull from if you train with more data in the future, rather than going to expand your regularization set every time you have more data than before.

That said, your reg dataset doesn't explicitly have to be varied. While variance is good for general-purpose usage, let's say you have a lot of screenshots in your primary dataset, or many images from the same or similar artist(s). In either case, stylistic bias could be difficult to remove. While tagging the image's styles can help, it isn't always enough to fully separate that style. In this case, you can create a reg dataset of the style specifically: Just chuck a bunch of the artist's works into a folder, take a bunch of screenshots, etc, and then name the reg folder with the appropriate tag.

During training, the trainer will alternate between training on your primary and regularization dataset - this will require you to have longer training to achieve the same amount of learning, but will very potently reduce biasing.

You can also use multiple different regularization datasets in the same training, just put both folders in the regularization directory you set during training. Remember that folder repeats will matter here - you can always rerun training with more repeats of the regularization if it isn't powerful enough, but be wary of increasing your step count too much.

Another thing to note, is that you can directly influence the strength of your regularization learning without just adding repeats or more images to the dataset. In the "advanced" tab of the GUI's settings, the setting "Prior loss weight" controls this. By default, it sits at 1: The closer this number is to 0, the stronger the effect will be - Just remember you'll need more learning steps to compensate. If set too high, you can outright ruin your training run by learning way to little to have any effect.

Part 3.6 | Tagging: Examples

Since examples are usually quite helpful, I'll put a handful of examples from my own datasets here for your own reference. Keep in mind: I usually train on fluffyrock, a model that uses e6 tagging. Other models should swap tags to their own variants where required. (ex: side view (e6) > from side (booru))

mizutsune, feral, blue eyes, bubbles, soap, side view, action pose, open mouth, realistic, twisted torso, looking back, white background

  • White backgrounds were more prevalent in this dataset, so the background was tagged.

arzarmorm, human, male, black hair, brown eyes, dark skin, three-quarter view, full-length portrait, asymmetrical armwear, skirt, pouches, armband, pants

  • In this case, the model wasn't cooperating with just the instance token alone, so the tags "asymmetrical armwear, skirt, pouches, armband, pants" were used as reinforcement, which also detached them from the main concept, allowing them to be controlled individually.

  • This LoRA also had very few instances of white backgrounds, so leaving it untagged was a non-issue.

Part 4 | Training: Basics

Now that you have your dataset, you need to actually train it, which requires a training script. The most commonly used script, which I also use, are the Kohya Scripts. I personally use the Kohya-SS GUI, a fork of the SD-Scripts command line trainer. It is usually a bit behind in updates, but is perfectly usable. Both are valid options, and other options exist, but for the sake of compatibility I'll stick with Kohya GUI as a frame of reference.

Once you have it installed and open (Install is actually quite easy.), make sure you navigate to the LoRA tab at the top (it defaults to dreambooth, an older method.)

There are a lot of things that can be tweaked in changed in Kohya, so we'll take it slow. Assume that anything I don't mention here can be left alone.

Yellow text like this denotes alternative, semi-experimental settings I'm testing. Feel free to give feedback if you do use them, but if you're looking for something stable, ignore these. These settings will change frequently as I test and train with them. Once I'm happy with a stable setup incorporating them, they will be adopted into the main settings.

Firstly, you'll find yourself in the "Source Model" tab.

Click on "model quick pick" and select "custom".

"Save trained model as" can stay as "safetensors". "ckpt" is an older, less secure format. Unless you're purposefully using an ancient pre-safetensor version of something, ckpt should never be used.

In "Pretrained model name or path", input the full file path to the model you'll use to train.

Underneath that, there are 3 checkboxes:

  • v2: Check if you're using a SD 2.X model.

  • v_parameterization: Check if your model supports V-Prediction (VPred).

  • SDXL Model: Check if you're using some form of SDXL, obviously.

Next, move to the "Folders" tab.

"Image folder" should be the full file path to your training folder, but not the one with the X_. You should set the path to the folder that folder is inside of. Ex: "C:/Training Folders/Concept Folder/".

"Regularization folder" should be left empty unless you plan to use a Prior Preservation dataset from section 3.5, following a similar path to the image folder. Ex: "C:/Training Folders/Regularization/RegConceptA/".

"Output folder" is where your models will end up when they are saved. Set this to wherever you like.

"Model output name" will be the name of your output file. Name it however you like.

Next, move to the "parameters" tab, which will put you on the "basic" subtab.

"Lora Type" should be kept as standard.

"Lora Type": LyCORIS/LoCon or Standard

  • Now that a1111 and other UIs have built-in LyCORIS support, there is really no downside to training as a LyCORIS over a standard LoRA (that I know of).

  • LoCONs, after a decent amount of trainings and testings, seem to overfit easier than a standard LoRA, which can cause some issues. I've moved back to standard LoRAs for the time being, but the upcoming DoRA implementations warrant some looking into eventually.

"LyCORIS Preset": Full

"Train Batch Size" is how many images will be trained simultaneously. This can speed up your training, but can cause less accurate/more generalized results, and isn't always beneficial. I usually keep this at 1, but never go higher than 4. This will also increase your vram usage substantially.

"Epoch" is the value you will be changing the most out of everything. Remember the "X_" on your training folder? This is where it's important. A single epoch in steps is the number of images you have, multiplied by the "X_" number. What you set this value to is dependent on your dataset, but as a rule of thumb I start with a number that has each image trained 100 times. If your folder is 1_, this would be 100 epochs, or if your folder is 10_ it would be 10 epochs. Both are the same. While not perfect, it makes a good starting point. Some concepts need less, some need more. It will be up to you to test your resulting output LoRA and see where it stands.

"Save every n epochs" saves a LoRA before it finishes every X number of epochs you set. This can be useful to go back to and see where your sweet spot might be.

"Mixed precision" and "Save precision" should both be set to the same value. "fp16" has higher precision data, but internally has smaller max values. "bf16" holds less precise data, but can use larger values, and seems faster to train on non-consumer cards (if you happen to have one). Choose based on your needs, but I stick with fp16 as the higher precision is generally better for more complex designs.

"Cache latents" and "Cache latents to disk": These affect where your data is loaded during training. If you have a recent graphics card, "cache latents" is the better and faster choice which keeps your data loaded on the card while it trains. If you're lacking VRAM, the "to disk" version is slower but doesn't eat your VRAM to do so.

"Optimizer": There are a number of options to choose from, but the four worth using IMO are Prodigy, DAdaptAdam, AdamW, and AdamW8bit. Prodigy is the newest, easiest to use, and produces exceptional results. The AdamW optimizers are quite old, but with fine tuning can produce results better than prodigy in a faster time. For the purposes of this guide, we'll be using Prodigy, but I will include some settings for Adam as well. (DAdaptAdam is very similar to Prodigy, and these settings should be largely applicable to it, as well. It has a less aggressive learning method, so if you're having issues with Prodigy try this out.)

  • I'm currently trying out a new AdamW setup.

"LR Scheduler": When using Prodigy/DAdapt, use only Cosine. When using an Adam opt, Cosine With Restarts is usually best. Other schedulers can work, but affect how the AI learns in some pretty drastic ways, so don't mess with these until your understanding of them is better.

"Optimizer extra arguments": If using Prodigy/DAdapt, set to "decouple=True weight_decay=0.1 betas=0.9,0.99".

  • AdamW: weight_decay=0.1 betas=0.9,0.99

"Learning Rate": When using Prodigy/DAdapt, set this to 1. Prodigy and DAdapt are adaptive and set this automatically as it trains.

  • As a general note, the specific "text encoder" and "unet" learning rate boxes override the main box, if values are set in them.

  • AdamW: TE LR of 0.00005, Unet LR of 0.0001

"LR Number of Cycles": If using an Adam opt, set this to 3. Only affects specific schedulers.

"Max resolution": For most models, you'll want this set to 768,768. Models that allow for larger native generation (like SDXL for example) can use larger values like 1024,1024. You should not set this to be larger than your model can generate natively. Less powerful cards can train at 512,512, but will have reduced quality. Alternatively, many models based on the NAI leak (most anime models), can be trained at 640,640.

"Enable buckets": True. This groups similarly sized images together during training. This is meant for batch training, but doesn't hurt to keep on.

"Max bucket resolution": Should be larger than your training resolution, I personally use 960 when at 768 training resolution. Any image larger than this size will be scaled down and/or cropped to fit it at their largest width, attempting to preserve its aspect ratio.

"Text Encoder & Unet learning rate": With AdamW, I have these set to 0.00005 and 0.0001 respectively.

"Network Rank & Network Alpha": These affect how large your file will be and how much data it can store. What you set this to will be dependent on your subject. If you're training something similar to what your model already knows (like an anime girl on an anime model) a Rank/Alpha of 8/8 will probably work. For most cases though, 32/32 is a good starting point. While you can go up to 128/128, that is absolute overkill that just bloats your file and in some cases can make your training results worse. Generally, you shouldn't need to go higher than 64. Your Alpha should be kept to the same number as your Rank, in most scenarios. Adaptive optimizers like Prodigy and DAdapt should set their alpha to 1.

  • When not using adaptive optimizers, there's some talk of using an alpha that's actually much higher than your rank, following the equation "(net alpha * sqrt(net dim))", which should better preserve learning rates. Common values using this would be 64/512, 32/181.02, 16/64, & 8/22.63, as rank/alpha respectively.

  • AdamW: 64/512

"Convolution Rank & Alpha": Rank of 16 w/ an alpha of 1. Going higher than 16 seems to give diminishing returns, and may actually harm outputs.

"Scale weight norms": This assists your LoRA in working well with other LoRAs in tandem, but can be semi-destructive to your output.

  • If you plan to use your LoRA with other LoRAs, set this value to 1.

  • If your LoRA will likely only ever be used on its own, leave at 0.

  • Depending on your concept, your weights that get too "heavy" are scaled down, reducing their impact. This allows multiple LoRAs to work in tandem by not fighting over values, but in some instances CAN negatively affect your final outputs.

  • Setting to values higher than 1 will reduce the impact, but also reduce cross-compatibility.

  • The scaling seems to have significantly less of a negative impact on LyCORIS training, given the learning is spread over more weights. Can usually be kept at 1 without worry.

  • Currently, I'm keeping this at 0 in favor of trying other regularization methods.

The three "dropout" values beside it I keep at 0.

  • "Network Dropout": 0.1

Now that those are set, we can move to the "advanced" subtab. We won't touch much here.

"Additional parameters": If your model supports zSNR, use "--zero_terminal_snr".

"Keep n tokens": For use with caption shuffling, to prevent the first X number of tokens from being shuffled, including the commas used to separate. I usually keep this set to 4 or 6, to keep the first 2-3 tokens from being shuffled.

"Clip skip": Should be set to the clip skip value of your model, most anime models use 2, most others use 1. If you're unsure, most civit models note the used value on their page.

"Gradient Checkpointing": Check to save VRAM at a slight speed cost. Has no effect on output quality.

"Shuffle Caption": If true, this will shuffle the tags (outside of the first X kept in place by "keep n tokens") every time the image is trained, which helps with general flexibility.

"Persistent Data Loader": This option keeps your images loaded in-between epochs. This eats a LOT of your VRAM, but will speed up training. If you can afford to use it, use it.

"CrossAttention": xformers

"Flip Augmentation": This allows you to essentially double your dataset by randomly mirroring your images horizontally during training. This can be especially useful if you have few images, but DO NOT use this if you have asymmetrical details that you want to preserve.

"Min SNR Gamma": 5 is a known good value.

"Noise offset type": Original

"Noise offset": Personally, I keep this at 0, as in my experience anything else has always gotten worse results.

"Debiased Estimation Loss": True

  • Seems to help with color deviation, supposedly makes training need fewer steps?

And that's everything! Scroll to the top, open the "configuration file" dropdown, and save your settings with whatever name you'd like. Once you've done that, hit "start training" at the bottom and wait! Depending on your card, settings, and image count, this can take quite some time.

Part 5 | Q&A

This section is reserved for tips, tricks, and other things I find handy to know that don't quite fit elsewhere. I'll try and update this periodically.

Q: I see in a lot of guides to train to 2000 steps or something similar, but you go by epochs. Why?

A: Due to the inconsistency in the size and quality of datasets, steps end up being a completely arbitrary means of measurement. Epochs, at least, count full folder repetitions, and make for a better means to measure training time. Many of those guides are also training very easy concepts, which the training will pick up on faster than others. Don't be worried if you massively overshoot that step count.

Q: I see other guides saying to set your Network Alpha to half of the Rank, why don't you?

A: That is a fairly old misconception that still gets thrown around a lot. Alpha functionally acts as a means to change your learning rates: It being half your Rank is half the learning rate. It doesn't hurt to have it at half or even lower, but you will likely need a longer training.

Q: My training script is showing a loss value that keeps changing as training goes, what is it?

A: For most cases, you don't need to worry about loss, nor should you worry over specific values or ranges. The only time you should pay attention to it is if you see it around a certain range for most of the training, just for it to make a massive change later in. That's a sign something may have went wrong, or it started to overtrain.

Q: How do I tell if my LoRA is under/overtrained?

A: Both should be fairly obvious, even to the untrained eye. If you're undertrained, you'll likely see "mushy" or incomplete details, or a very low adherence to details. If you're overtrained, you may have odd, over-saturated colors, style biasing, pose biasing, etc. These will vary depending on your dataset, so keep an eye out.

Q: You briefly talked about fp16 and bf16, but what are the "full" versions I'm seeing?

A: "Standard" fp/bf16 use mixed precision, while the "full" versions don't. It's misleading, but the full versions hold less precise data, but can be incredibly fast to train with. I'm sure they have their uses, but in most cases you're perfectly fine in staying with mixed precision.

Q: I keep seeing mentions of "Vpred", what exactly is it?

A: Vpred, or V-Prediction, or V-Parameterization, are all the same thing. While I don't fully understand it at a technical level, as far as I am aware it is an optimization to the noise schedulers that "predicts" outputs during image generation, allowing for a final result to be generated in fewer steps.

Q: What is Min SNR? zSNR? Zero Terminal SNR? Are they the same, or different?

A: No, while similar, they do rather different things. To keep it simple, zSNR (Zero Terminal SNR) is a technique that allows for the AI to generate using a wider color space, including perfect blacks. Think of it like the difference between a normal monitor and a HDR OLED monitor. Min SNR is a method of accelerated training convergence, which allows models to train in fewer steps.

Q: Could I train at a resolution higher than what my training model can do?

A: Can you? Yes. Should you? No. While normally higher resolutions are a tradeoff of quality for speed, in this case you would be trading speed for worse results. Without getting technical, training larger than your model can handle is not good for your outputs.

Q: You mentioned not to "overtag" your images, but how many is too many?

A: This will really depend on your dataset and training settings. Longer trainings can help with overtagging, but run a greater risk of overtraining. Generally, try and keep your per-image total to 20 or below on average, but having outliers with more isn't the worst. Try and avoid tags that aren't important to the image (unless you're finding that the results are clinging to something too much, in that case tag it), and tags that your model has little to no knowledge of. Empty tags are seen as training targets, and will try to be filled. If filled with the wrong data, you can end up with seemingly random tags being required to get the intended result.

Q: What's the difference between a LoRA and a LyCORIS? Are they even different?

A: Every LyCORIS is a LoRA, but not every LoRA is a LyCORIS. LyCORIS specifically refers to a subset of newer LoRA training methods (LoCon, LoHa, LoKR, DyLoRA, etc.). All of these are still LoRAs, but their new methodologies make them structurally different enough to have their own designation. Now that most GUIs have built-in support for them, to an end user they functionally make no difference in their usage. LoRA on its own simply refers to the original method.

Q: My LoRA kinda works, but has very strange, distorted anatomy at times. What happened?

A: More often than not, distorted anatomy originates from your dataset. Look it over for images that are similar to the distortions you are seeing. Uncommon poses, strange camera angles, improperly tagged duo/group images, and other outliers can be likely causes. Try tagging what's applicable, but it's usually best to remove the image entirely or crop out the parts causing issues, if possible.

Q: I've heard a bit about single-tag training, what is it?

A: Training with a single tag is a very old method commonly used by beginners who don't want to spend time tagging. When training to a single tag, the AI will "homogenize" everything it learns from an image into the tag, resulting in highly generalized outputs. This will only even begin to work if every image is of a specific subject (like a character), and has a very high likelihood of latching on to specific backgrounds, poses, and other unwanted variables. If used with anything else that isn't repetitive, you'll end up with what is effectively digital mush. I would not recommend this for any application.

Part 6 | Advanced Training: Multi-Concept LoRA

So you've got your feet wet, and want more of a challenge? Or maybe you've got a character with many outfits? Gender-specific armor? That's where multi-concept training comes in.

The actual training settings for these are almost exactly the same compared to normal LoRAs, with a few caveats:

  • Do not use a batch size higher than 1. If images from multiple concepts get loaded, they'll generalize into mush, or you'll have one overpower the other.

    • Possibly no longer the case, but unsure at this moment.

  • Be careful with using flip augmentation, as it will apply to every image, not just one concept.

  • Depending on how many concepts you're training, and how complex they are, you may want to increase your Rank and Alpha values. I recommend trying 32 first and seeing how it performs.

Now, gather your images the same way I detailed before, but separate them based on their concepts (outfits, armors, etc). Any editing, too, should be done like before.

Once you've fully prepared your data, figure out which concept has the most data, and in your concept folder, create a 1_conceptname folder for it.

Now, do the same with your other concepts, obviously replacing "conceptname" with their instance token.

Once you have your folders named and filled, do the following:

  • Take the number of images in your largest folder, then multiply them by the "X_" to get your total step count. (images*folder repeats) = steps

    • For example, Folder A has 51 images, Folder B has 43 images. Folder A would be used. Assuming 10 folder repeats, that gives us 510 steps for Folder A. (51*10)

  • Now, divide the step count by the number of images in your second largest folder. The resulting number, rounded to the nearest whole, is the number that that folders "X_" should be changed to.

    • So, Folder B has 43 images. (510/43) = 11.86, which we round up to 12. We now have 10_FolderA and 12_FolderB.

  • Repeat this for every applicable folder.

    • Folder C has 32 images, so we compare it to folder A just like before. (510/32) = 15.93, which we round up to 16.

  • In our example, we now have three folders balanced together. These could be left as is, or, since they are all divisible by two, you can reduce each X_ by half to get 6_, 8_, & 5_ respectively. Remember you will be multiplying these by your epochs!

Why do this, you ask?

We do this to balance the dataset. If you keep everything the same, the folder with the most images will dominate the training, leaving the other concepts with a fraction. We balance the dataset to ensure every concept gets equal training time, which prevents one from dominating and the other concepts from undertraining.

You should keep in mind, however, if you have very few images in a concept folder that individual concept could overbake, even if the rest of the LoRA is fine. This is a bigger issue the larger the discrepancy between it and the largest folder is.

Now that your folders are balanced, we should look at how you name them, and what your activator tag for each will be.

If you're training a character with multiple outfits, name your folders like "1_charactertag, outfittag". Your first two tags should be those, in that order.

If you're training something not tied to a character, like gendered armor, I usually just create a tag for each version. For example, "armortagm" and "armortagf" for males and females respectively. Just like before, these should be the first tag on their respective images.

Now that your names and activator tags are settled, you can start tagging! This can be done just like a normal lora, you've just got a whole lot more images to go through.

And that's it! once you've tagged, you can train it just like before. You'll likely have much longer training times, given the increase in images, but in the end you'll have multiple concepts in a single LoRA to use as you please.

Part 7 | Advanced Training: LyCORIS & Its Many Methods

LyCORIS gets more advanced by the day, and as it increases in commonality I feel it best to have a section talking about it. This will be slightly more technical than the rest, but I'll try to keep it to the "need-to-know" stuff.

LyCORIS Types:

  • LoCON: A LoRA with that also affects the convolution layers of the base model, allowing for more dynamic outputs.

  • LoHa & LoKR: A LoRA that essentially is two different versions of itself, which are combined/averaged by Hadamard Product and Kronecker Product respectively. They take longer to train, and are more oriented towards generalized training.

  • DyLoRA: Short for Dynamic LoRA, this is a LoRA implementation that allows the Rank to change dynamically, but is otherwise a normal LoRA.

  • GLoRA: Short for Generalized LoRA, this is an implementation that is made for generalizing diverse datasets in a flexible and capable manner.

  • iA3: Instead of affecting rank like most LoRA, iA3 affects learned vectors, resulting in a very efficient training method. Similar (seemingly a bit better?) to a normal LoRA, in a much smaller package.

  • Diag-OFT: This implementation "preserves the hyperspherical energy by training orthogonal transformations that apply to outputs of each layer". In short, this type is better at preserving the base models original understanding of items that are coincidental to the training (like backgrounds and poses). This also apparently converges (trains) faster than a standard LoRA.

  • Native Fine-Tuning: Also known as dreambooth, which we aren't focusing on and will ignore for this guide. The LyCORIS implementation allows it to be used like a LoRA, but it produces very large files.

"So, what should I use?"

I would personally say each has their own uses, so I've categorized them semi-generally. I'm still not super knowledgeable about their intricacies, but I've largely based these on their official implementation notes and documentation. What you choose is up to you and entirely based on your needs.

  • General Purpose:

    • LoCON, DyLoRA, iA3, Diag-OFT

  • Multi-Concept:

    • LoCON, LoHa, LoKR

  • Concepts:

    • LoCON, LoHa, LoKR, GLoRA

  • Styles:

    • LoCON, GLoRA, iA3

Benefits, Drawbacks & Usage Notes:

  • LoCON:

    • Widely Applicable

    • Affects More Model Layers

    • Slightly Larger Files

    • Basically Just A LoRA, But Better

      • Dim <= 64 Max, 32 Recommended

      • Alpha >= 0.01, Half Recommended (When not using an adaptive optimizer)

  • LoHa & LoKR:

    • Good With Multi-Concept Training

    • Good With Generalization

    • Longer Training Times

    • Bad With Highly Detailed Concepts

    • Can Be Hard To Transfer

      • LoHa

        • Dim <= 32

        • Alpha >= 0.01, Half Recommended (When not using an adaptive optimizer)

      • LoKR

        • Small: Factor = -1

        • Large: Factor = ~8

  • DyLoRA:

    • Automatically Finds Optimal Rank

    • Longer Training Times

    • Otherwise Just A LoRA

      • Use with large (~128) Dim, Half Alpha (When not using an adaptive optimizer)

      • Use Gradient Accumulation

      • Batch Size of 1 Max

  • GLoRA:

    • Very Good At Generalization (Styles & Concepts)

    • Shorter Training Times (?, To Test)

    • Not Very Good At Training Non-Generalized Subjects

  • iA3:

    • Very Small File Sizes

    • Generally Applicable

    • Generally Performs Better Than LoRA

    • Good With Styles

    • Can Be Hard To Transfer

      • Use with High LR (When not using an adaptive optimizer), official implementation recommends 5e-3 (0.005) ~ 1e-2 (0.01)

  • Diag-OFT:

    • Faster Training Time

    • Better Preserves Coincidentals

    • Generally Applicable

Part 8 | Advanced Training: Styles & Themes

So, you want to train a style of some kind. Regardless of what it is, for broader concepts a LyCORIS is the tool for the job, but unlike a LoRA, there are several kinds of LyCORIS to choose from. If you skipped Part 7, I recommend a LoCON, GLoRA, or iA3.

Once you've chosen your type, make sure your rank is set to 32 or lower. LyCORIS seems to have some issues above certain points (though you can go as high as 64 I believe), but 32 is the generally agreed upon maximum before you start getting issues.

Now that that's out of the way, you should start building a dataset, just like before. However, style trainings benefit much more from larger datasets, so instead of the 15-50 range from before, look to get around 50-200, in my experience 125-150 is a good place to be.

Once you've got your images, start tagging. You can generally tag the same way as before, but keep in mind that you want the style, not a character or article of clothing. You should especially be sure to tag backgrounds, clothing, and any other key element.

After tagging, you're good to start training. In my experience, these usually take fewer epochs to train compared to a LoRA: While I recommend ~100 repeats for a LoRA, these are usually ok with ~30-40 repeats, but your mileage may vary, given the size and composition of your dataset.

Changelog

3/24/24

  • Added missing learning rates for my AdamW setup. Whoops.

  • Slightly expanded Part 3.5.

3/4/24

  • Added section covering class tokens to Part 3.

  • Changed Part 3.5 (Tagging examples) to 3.6.

  • Added new Part 3.5, covering Prior Preservation.

    • Added Regularization Folder setting to Part 4.

  • Updated experimental settings.

  • Tweaked some descriptions of settings.

  • Added a few clarifications in misc. areas.

2/29/24

  • An extensive but overall minor rework, implementing critiques from fellow trainers, namely ArgentVASIMR.

    • Part 1:

      • Expanded on Duo/Trio/Group images.

    • Part 2:

      • Expanded/edited several paragraphs to provide more alternatives and clearer info.

    • Part 3:

      • Minor edits, changed terminology to be more appropriate.

    • Part 4:

      • Tweaked some settings, mostly in regards to Adam opt usage.

      • Added a bit more description to some settings.

      • Added parameters for caption shuffling.

      • Added experimental section regarding alpha scaling when not using adaptive optimizers.

    • Part 5:

      • Tweaked some wording to better adhere to proper terms.

      • Tweaked Q&A regarding Full fp16 training.

    • Part 6:

      • Tweaked some terminology, again.

      • Slightly changed and provided an example for the dataset balancing formula.

2/26/24

  • Updated experimental settings.

  • Tweaked some of the explanations minorly in regards to LyCORIS.

2/9/24

  • Updated experimental settings.

  • Added more details to part 4.

  • Added a brief section regarding some new findings to part 2.

2/1/24

  • Added part 2.5, a subsection regarding Nightshade and other AI "poisons".

1/7/24

  • Moved part 7 to part 8 & removed LyCORIS explanation.

  • Added (new) part 7, going more in-depth on LyCORIS.

  • Tweaked some experimental parameters.

1/5/24

  • Tweaked experimental settings & added some explanations to some values.

  • Added Q&A questions.

  • Expanded on the "scale weight norms" value in part 4.

  • Corrected sections regarding minsnr and zsnr to differentiate them correctly.

  • Tweaked "additional parameters": Value no longer required.

12/30/23

  • Added experimental settings to part 4.

  • Changed title to include LyCORIS.

  • Added Q&A question.

12/28/23

  • Correction of more grammar errors.

  • Slightly expanded Part 1 & 2.

  • Added section covering implied tags to Part 3.

  • Added minor elaborations to some areas.

12/27/23

  • Correction of minor grammar errors in parts 3 & 4.

  • Added new Q&A questions.

  • Added parts 6 & 7, covering Multi-concept and Style training respectively.

  • Added part 3.5 for tagging examples, added two to begin with.

12/26/23

  • Created Guide

155

Comments