After copy-pasting a guide I wrote in discord several times, I think it's time I consolidated and expanded on it here on Civit. With so many guides out of date and/or with incorrect information, I hope this will be helpful to aspiring and current trainers alike.
Keep in mind this is based off of my own workflow and experiences! It will likely be by no means perfect, and I don't plan to make this an absolute encyclopedia, either. Just a good ol' crash-course to get your feet wet.
I am also going to assume you have a system capable enough for local training. (Though this theoretically should be mostly applicable to training in other environments.)
That being said, this guide will assume you haven't even gathered a dataset yet, so let's dive in!
Part 1 | Datasets: Gathering & Basics
Your dataset is THE MOST IMPORTANT aspect of your LoRA, hands down. A bad dataset will produce a bad LoRA every time, regardless of your settings. Garbage data in gives garbage data out!
Ideally, training a good LoRA will use 30 images at a minimum, but you can use more or less. IMO, more than 50 is overkill, and going less than 15 can make things overly difficult, but I've trained with as few as 8 before. Staying near that 30-image sweet spot is ideal, though. There is a method to artificially expand your dataset, but we'll touch on that later.
When assembling your images, ensure you go for quality over quantity. A well-curated set of 30 images can easily outperform a set of 100 poor and mediocre images. Especially in smaller datasets, a single "bad" image can offset the entire model in a bad way. That being said, bad images CAN be used to pad out a dataset, but should be tagged properly (such as with "colored sketch", which will be talked about later).
You should also ensure your images are varied. Too many images of a similar/same style will try and bake itself into your concept, making changing the style exceptionally difficult, and making any style changes biased. Especially when dealing with lots of screenshots and renders, you should be careful. If you do have a significant amount, tagging them with the artist that made them, as a render, etc, can help tie the style to another tag and reduce the impact.
I would also recommend avoiding fetish-themed images when working with characters (unless you want that out of your LoRA), as even when tagged their often extreme anatomy can skew your model in a way worse than if they weren't present at all. You can of course use them to expand your dataset if you truly need to, but make sure they are tagged thoroughly.
Personally, I gather my data from a variety of sites: e621, furaffinity, deviantart, pixiv, and the monster hunter wiki (and game wikis in general) are my common sources. Again, make sure you try and avoid pulling too much from the same artist and similar styles. Google image search is also worth looking at if you need more data, as it can often find isolated instances from reddit, steam community feeds, and other sites you may not have thought of looking through.
Pixiv is a godsend for monster hunter & other eastern franchises as a JP art site, but finding specifics can be difficult at times as you need to use japanese text to guarantee your search. Thankfully, a number of wikis also include japanese names if applicable.
As you gather your images, you should consider what is it you want to train: A character? A style? An object? Some clothing?
For most of these, you should look for solo images of the subject in question. Duos also work, but you should only grab them if extra individuals can be easily removed or cropped out. Duos/Trios/Group images are still ok to use if you really still need images, but if they can't be cleaned they should be avoided if possible. If you do include them, make sure they are properly tagged and kept to a minimum.
If you're planning on training a style, know that those are more advanced, and are better suited to LyCORIS than LoRA. This guide will still be largely applicable, but check the later parts for details specific to them.
Once you have your images, place them in a folder for preparation in the next part. (For example, "Training Folders/Concept Folder/Raw")
Part 2 | Datasets: Preparation
Once you have your raw images from part 1, you can begin to preprocess them to get them ready for training. You will need a photo editor program handy, I recommend Photopea as a free web alternative to photoshop.
Personally, I separate my images into two groups: Images that are ok on their own, and images that require some form of editing before use. Those that meet the below criteria are moved to another folder and then edited accordingly.
Take a look at the extension of your images. .webp images (usually pulled of wikis) are incompatible with current trainers, and must be converted to a PNG or JPEG. While you do that, note that images with transparent backgrounds also cause issues. These should be brought into your image editor of choice, and just giving them a white BG then saving will suffice.
Next, consider your training resolution. Higher resolutions let you get more detail out of an image, but will slow your training time. Most people still train at 512 or 768, but I train on furryrock, which supports higher native resolutions, and train at 960 or 1088. Any image that is larger than your resolution will be scaled down automatically. The resolution will be detailed further later.
Once you have an idea of your resolution, take a look at your dataset. Keep in mind that non-square images will be scaled to maintain their proportions, so having lots of empty background can be detrimental to getting details. Wide and tall images with lots of empty background can be cropped to focus on the subject.
Additionally, if you have hi-res images with multiple depictions of your subject (like a reference sheet) you can crop the image into multiple parts to have it trained over several images over one overly compressed one. Such images can also be trained without cutting and cropping, just be sure to tag them with "multiple views" and "reference sheet" later on.
Images with people other than the subject should be edited to have them removed if possible, be it via cropping or removal.
If possible, watermarks, links, text/speech bubbles, and signatures should be removed. You should remember that AI learns off of repetition, and the same signature in the same corner will be something it tries to hold on to. It's alright if a handful still have them, but ideally you want as few as possible. Since repetition is key, outliers are less likely to stick. The magic eraser tool is very useful for any of these that aren't on a flat color background.
If you have images smaller than your training resolution, consider upscaling them. Upscalers like 4x_Ultrasharp are great for this.
If you have images larger than 3k pixels, downscale them to 3k or less. Apparently, the Kohya trainer has some minor issues handling very large images - I'm not entirely sure why, but downscaling oversized images in my dataset showed some improvement.
Lastly, if you have a subject with asymmetrical details (like a marking, logo, single robot arm, etc), make sure it is facing the same way in each image. Images incorrectly oriented should be flipped for consistency.
Once you've done the following to your entire dataset, place them in a new folder alongside your "raw" folder like so: "Training Folders/Concept Folder/X_ConceptName" the 'ConceptName' will be your trigger tag, and the 'X' will be the number of repeats on their folder per epoch, which will be detailed later. It should look something like "1_Hamburger".
Part 2.5 | Datasets: Curing Poison
Since Nightshade especially is getting a lot of traction right now, I figure I'll put a section here covering "poisoned" images. You won't run into these too often, but it's quite possible as they increase in popularity.
The purpose of image poisoning tools like Glaze and Nightshade is to add "adversarial noise" to an image, which disrupts the learning process by effectively adding insane outliers and obscuring the original data it would train on. As such, including a poisoned image in your dataset can result in strange abnormalities, be it color variations, distorted anatomy, etc. The more poison you have, the worse the effects will be. Ideally, you don't have any - but you CAN still use them.
These "poisons" have a hilarious weakness - the very noise they're introducing. By simply taking the desired image, and putting it through an AI upscaler good at denoising (like with jpeg artifacts or the sort), or even just a general upscaler, you can just... strip away the poison. It's that easy, usually. People are still experimenting with the "best" methods for removal, but frankly, especially with Nightshade, pretty much any method can clean the image to a usable state.
"Smoothing" or "Anti-artifact" upscalers work best for the job, used with one of the two following methods:
A: Just upscale it. 2x is usually fine.
B: Downscale to half or 3/4 size, then upscale with AI. Works best with already large images, small resolution images would lose too much detail.
Alternatively, "adverse cleaner" can do a decent job, and exists as an extension for A1111 or as a HF Space. Combined with the upscaling methods above, you can effectively neutralize the "poison" entirely.
"But how do I recognize a poisoned image?"
It depends on how aggressively the work was poisoned - If it looks like shit because it looks like a 3yr old put a silly camera filter on it, or it has some pretty obvious artifacting, or looks like the entire image is covered in a Jpeg compression artifact - It's 9/10 times poisoned. Less aggressive poisons are harder to detect, but have less of an impact on your training. If you're unsure, take a close look at it in a photo editor, and/or just run it through the cleaning methods before to be safe.
As a general note, the individuals who put hyper-aggressive poisons on their work are usually delusional enough their art isn't even worth using in the first place - Self-respecting artists generally keep the poison minimal to not affect their work visually in any major way, or just don't use it. If you don't feel like dealing with poison, pull your data from older images if you want to be wholly safe, or just learn to identify poisons and skip by them.
Part 3 | Datasets: Tagging
Almost done with the dataset! We're in the final step now, tagging. This will be what sets your trigger tag(s), and will affect how your LoRA is used.
There are a variety of ways to do your tagging, and a multitude of programs to assist tagging or do auto-tagging. However, in my opinion you shouldn't use auto tagging (especially on niche designs and subjects) as it makes more work than it assists with.
Personally, I use the Booru Dataset Tag Manager and tag all of my images manually. You COULD tag without a program, but just... don't. Manually creating, naming, and filling out a .txt for every image is not what you want to do with your time.
Thankfully, BDTM has a nice option to add a tag to every image in your dataset at once, which makes the beginning of the process much easier.
Before you tag, you need to choose a model to train on! For the sake of compatibility, I suggest you train on a Base Model, which is anything like a finetune that is NOT a mix of other models. Training on a mix is still viable, but in my experience makes the outputs less compatible with anything not that model.
Now, for the tagging itself. Before you do anything, figure out what type of tags you'll be using:
Currently, there are three types of prompting styles, as follows: Natural Language Prompting, Booru Tag Prompting, and e6 Tag Prompting, which you should use based on your models "ancestry."
Base SD1.5, Base SDXL, and (currently) most SDXL models use Natural Prompting, ex: "A brown dog sleeping at night, they are very fluffy." Sometimes works on other models, but is not recommended. SDXL is also an absolute pain to train.
The vast majority of Anime models you see use Booru prompting, specifically using the tag list from Danbooru, an anime image board. I hear Anything v4.5 is a good choice.
Models with an ancestry based on furry models use e6 prompting, using the tag list from e621. Fluffyrock or BB95 is a good choice here.
Once you know what model and tags you're using, you can start tagging.
Your FIRST tag on EVERY image should be your activation tag, aka what you named X_ConceptName, in this case "conceptname". If your model already has your subject even remotely trained to that tag already, consider changing your activator to a string it wouldn't have. For example, "hamburger" could be "hmbrgrlora". This isn't always required, but if you see wacky results that stem from the models original interpretation, you might want to do so.
My process works something like:
Add to all images at once: Activator tag (usually species), anthro/feral/human, gender (if applicable, ferals not specified), controllable elements (ie. a character-specific outfit), nude, other common controllables (like most common eye color).
Move to first image; Remove if needed: controllable elements. Change if needed: nude (to general outfit tag(s)), eye color, etc.
Add tags you would consider to be "key" elements to the image: Specific mediums (like watercolor), compositionals (ie three-quarter portrait), etc.
Add tags to describe deviated aspects: huge/hyper breasts, horns/scales/skin of varied color, etc.
Repeat for each image.
That being said, don't go overboard with your tags. If you use too many, you'll "overload" the trainer and get less accurate results, as it's trying to train to too many tags. It's generally best practice to only tag items you would consider a "key element" of the image. Undertagging is better than overtagging, so if in doubt keep it minimal. I usually have ~5-20 tags per image, depending on their complexity.
Backgrounds and poses can often be ignored, but if you have specific kinds of locations/poses/BGs in a significant number of your images, you should tag them to prevent biasing.
For example, if you have a lot of white backgrounds, you should tag "white background". If after a training you see a specific pose being defaulted to, you should find all instances in your dataset using that pose and tag it.
You should also be wary of "implied" tags. These are tags that imply other tags just by their presence. By having an implied tag, you shouldn't use the tag(s) it implies alongside it. For example, "spread legs" implies "legs", "german shepherd" implies "dog", so on and so forth. Having the tags that are implied by another spreads the training between them, weakening the effect of your training. In large quantities, this can actually be quite harmful to your final results.
Tagging low-quality images: Sometimes, you just don't have a choice but to use poor data. Rough sketches, low-res screenshots, bad anatomy, and others all fall into this category.
Sketches can usually be tagged "colored sketch" or "sketch", which usually is all you need to do. If uncolored, "monochrome" and "greyscale" are usually good to add, as well.
Low resolution images should be upscaled with an appropriate upscaler if possible, such as one of the many made to upscale screenshots from old cartoons, for example. If you can't get a good upscale, use the appropriate tag for your model to denote the resolution quality.
Bad anatomy should be tagged as you see it, or cropped out of frame if possible. Images with significant deviations that can't be cropped or edited, like the neck/head/shoulders being off-center, those are usually best left out of the dataset entirely.
Once you've tagged all your images, make sure you've saved everything and you'll be good to go for the next step.
Part 3.5 | Tagging: Examples
Since examples are usually quite helpful, I'll put a handful of examples from my own datasets here for your own reference. Keep in mind: I usually train on fluffyrock, a model that uses e6 tagging. Other models should swap tags to their own variants where required. (ex: side view (e6) > from side (booru))
mizutsune, feral, blue eyes, bubbles, soap, side view, action pose, open mouth, realistic, twisted torso, looking back, white background
White backgrounds were more prevalent in this dataset, so the background was tagged.
arzarmorm, human, male, black hair, brown eyes, dark skin, three-quarter view, full-length portrait, asymmetrical armwear, skirt, pouches, armband, pants
In this case, the model wasn't cooperating with just the trigger tag alone, so the tags "asymmetrical armwear, skirt, pouches, armband, pants" were used as reinforcement, which also detached them from the main concept, allowing them to be controlled individually.
This LoRA also had very few instances of white backgrounds, so leaving it untagged was a non-issue.
Part 4 | Training: Basics
Now that you have your dataset, you need to actually train it, which requires a training script. The most commonly used script, which I also use, is the Kohya-SS GUI. While other options exist, for the sake of compatibility I'll stick with Kohya as a frame of reference.
Once you have it installed and open (Install is actually quite easy.), make sure you navigate to the LoRA tab at the top (it defaults to dreambooth, an older method.)
There are a lot of things that can be tweaked in changed in Kohya, so we'll take it slow. Assume that anything I don't mention here can be left alone.
Yellow text like this denotes alternative, semi-experimental settings I'm testing. Feel free to give feedback if you do use them, but if you're looking for something stable, ignore these. These settings will change frequently as I test and train with them. Once I'm happy with a stable setup incorporating them, they will be adopted into the main settings.
Firstly, you'll find yourself in the "Source Model" tab.
Click on "model quick pick" and select "custom".
"Save trained model as" can stay as "safetensors". "ckpt" is an older, less secure format. Unless you're purposefully using an ancient pre-safetensor version of something, ckpt should never be used.
In "Pretrained model name or path", input the full file path to the model you'll use to train.
Underneath that, there are 3 checkboxes:
v2: Check if you're using a SD 2.X model.
v_parameterization: Check if your model supports V-Prediction (VPred).
SDXL Model: Check if you're using some form of SDXL, obviously.
Next, move to the "Folders" tab.
"Image folder" should be the full file path to your training folder, but not the one with the X_. You should set the path to the folder that folder is inside of. Ex: "C:/Training Folders/Concept Folder/".
"Output folder" is where your models will end up when they are saved. Set this to wherever you like.
"Model output name" will be the name of your output file. Name it however you like.
Next, move to the "parameters" tab, which will put you on the "basic" subtab.
"Lora Type" should be kept as standard.
"Lora Type": LyCORIS/LoCon
Now that a1111 and other UIs have built-in LyCORIS support, there is really no downside to training as a LyCORIS over a standard LoRA (that I know of).
While they have slightly larger file sizes, being able to affect more of the base model more than makes up the difference.
"LyCORIS Preset": Full
"Train Batch Size" is how many images will be trained simultaneously. This can speed up your training, but can cause less accurate/more generalized results, and isn't always beneficial. I usually keep this at 1, but never go higher than 4. This will also increase your vram usage substantially.
"Epoch" is the value you will be changing the most out of everything. Remember the "X_" on your training folder? This is where it's important. A single epoch in steps is the number of images you have, multiplied by the "X_" number. What you set this value to is dependent on your dataset, but as a rule of thumb I start with a number that has each image trained 100 times. If your folder is 1_, this would be 100 epochs, or if your folder is 10_ it would be 10 epochs. Both are the same. While not perfect, it makes a good starting point. Some concepts need less, some need more. It will be up to you to test your resulting output LoRA and see where it stands.
"Save every n epochs" saves a LoRA before it finishes every X number of epochs you set. This can be useful to go back to and see where your sweet spot might be.
"Mixed precision" and "Save precision" should both be set to the same value. "fp16" holds more data, but is slower to train. "bf16" holds slightly less data, but is faster to train. Choose based on your needs, but I stick with fp16 as it is better for more complex designs.
"Cache latents" and "Cache latents to disk": These affect where your data is loaded during training. If you have a recent graphics card, "cache latents" is the better and faster choice which keeps your data loaded on the card while it trains. If you're lacking VRAM, the "to disk" version is slower but doesn't eat your VRAM to do so.
"Optimizer": There are a number of options to choose from, but the four worth using IMO are Prodigy, DAdaptAdam, AdamW, and AdamW8bit. Prodigy is the newest, easiest to use, and produces exceptional results. The AdamW optimizers are quite old, but with fine tuning can produce results better than prodigy in a faster time. For the purposes of this guide, we'll be using Prodigy. (DAdaptAdam is very similar to Prodigy, and these settings should be largely applicable to it, as well. It has a less aggressive learning method, so if you're having issues with Prodigy try this out.)
"LR Scheduler": When using Prodigy/DAdapt, use only Cosine. When using an Adam opt, Cosine With Restarts is usually best. Other schedulers can work, but affect how the AI learns in some pretty drastic ways, so don't mess with these until your understanding of them is better.
"Optimizer extra arguments": If using Prodigy/DAdapt, set to "decouple=True weight_decay=0.1 betas=0.9,0.99", otherwise, leave empty.
weight_decay=0.15 seems to be a bit better for my Prodigy workflow?
"Learning Rate": When using Prodigy/DAdapt, set this to 1. Prodigy and DAdapt are adaptive and set this automatically as it trains.
"Max resolution": For most models, you'll want this set to 768,768. Models that allow for larger native generation (like SDXL for example) can use larger values like 1024,1024. You should not set this to be larger than your model can generate natively. Less powerful cards can train at 512,512, but will have reduced quality.
"Enable buckets": True. This groups similarly sized images together during training. This is meant for batch training, but doesn't hurt to keep on.
"Max bucket resolution": Should be the same as your training resolution, 768 in this example. Any image larger than this size will be scaled down to fit it.
"Network Rank & Network Alpha": These affect how large your file will be and how much data it can store. What you set this to will be dependent on your subject. If you're training something similar to what your model already knows (like an anime girl on an anime model) a Rank/Alpha of 8/8 will probably work. For most cases though, 32/32 is a good starting point. While you can go up to 128/128, that is absolute overkill that just bloats your file and in some cases can make your training results worse. Generally, you shouldn't need to go higher than 64. Your Alpha should be kept to the same number as your Rank, in most scenarios. Adaptive optimizers like Prodigy and DAdapt should set their alpha to 1.
"Convolution Rank & Alpha": Rank of 16 w/ an alpha of 1. Going higher than 16 seems to give diminishing returns, and may actually harm outputs.
"Scale weight norms": This lets your LoRA work well with other LoRAs in tandem.
If you plan to use your LoRA with other LoRAs, set this value to 1.
If your LoRA will likely only ever be used on its own, leave at 0.
Depending on your concept, your weights that get too "heavy" are scaled down, reducing their impact. This allows multiple LoRAs to work in tandem by not fighting over values, but in some instances CAN negatively affect your final outputs.
Setting to values higher than 1 will reduce the impact, but also reduce cross-compatibility.
The scaling seems to have significantly less of a negative impact on LyCORIS training, given the learning is spread over more weights. Can usually be kept at 1 without worry.
The three "dropout" values beside it I keep at 0.
Now that those are set, we can move to the "advanced" subtab. We won't touch much here.
"Additional parameters": If your model supports zSNR, use "--zero_terminal_snr".
Vpred models can also use "--scale_v_pred_loss_like_noise_pred".After a recent update, this parameter is no longer required. New and updated Kohya installs can remove/ignore it.
"Clip skip": Should be set to the clip skip value of your model, most anime models use 2, most others use 1. If you're unsure, most civit models note the used value on their page.
"Gradient Checkpointing": Check if using an Adam opt. I haven't seen much of an improvement when used with Prodigy, but this usually is a slight performance boost. Has no effect on output quality.
"Persistent Data Loader": This option keeps your images loaded in-between epochs. This eats a LOT of your VRAM, but will speed up training. If you can afford to use it, use it.
"Flip Augmentation": This allows you to essentially double your dataset by duplicating your images and mirroring them horizontally during training. This can be especially useful if you have few images, but DO NOT use this if you have asymmetrical details that you want to preserve.
"Min SNR Gamma": 5 is a known good value.
"Noise offset type": Original
"Noise offset": Personally, I keep this at 0, as in my experience anything else has always gotten worse results.
"Debiased Estimation Loss": True
Seems to help with color deviation, supposedly makes training need fewer steps?
And that's everything! Scroll to the top, open the "configuration file" dropdown, and save your settings with whatever name you'd like. Once you've done that, hit "start training" at the bottom and wait! Depending on your card, settings, and image count, this can take quite some time.
Part 5 | Q&A
This section is reserved for tips, tricks, and other things I find handy to know that don't quite fit elsewhere. I'll try and update this periodically.
Q: I see in a lot of guides to train to 2000 steps or something similar, but you go by epochs. Why?
A: Due to the inconsistency in the size and quality of datasets, steps end up being a completely arbitrary means of measurement. Epochs, at least, count full folder repetitions, and make for a better means to measure training time. Many of those guides are also training very easy concepts, which the training will pick up on faster than others. Don't be worried if you massively overshoot that step count.
Q: I see other guides saying to set your Network Alpha to half of the Rank, why don't you?
A: That is a fairly old misconception that still gets thrown around a lot. Alpha functionally acts as a means to change your learning rates: It being half your Rank is half the learning rate. It doesn't hurt to have it at half or even lower, but you will likely need a longer training.
Q: My training script is showing a loss value that keeps changing as training goes, what is it?
A: For most cases, you don't need to worry about loss, nor should you worry over specific values or ranges. The only time you should pay attention to it is if you see it around a certain range for most of the training, just for it to make a massive change later in. That's a sign something may have went wrong, or it started to overtrain.
Q: How do I tell if my LoRA is under/overtrained?
A: Both should be fairly obvious, even to the untrained eye. If you're undertrained, you'll likely see "mushy" or incomplete details, or a very low adherence to details. If you're overtrained, you may have odd, over-saturated colors, style biasing, pose biasing, etc. These will vary depending on your dataset, so keep an eye out.
Q: You briefly talked about fp16 and bf16, but what are the "full" versions I'm seeing?
A: "Standard" fp/bf16 use mixed precision, while the "full" versions don't. It's misleading, but the full versions hold less data, and in my experience are downright worse. I'm sure they have their uses, but in most cases you're perfectly fine in staying with mixed precision.
Q: I keep seeing mentions of "Vpred", what exactly is it?
A: Vpred, or V-Prediction, or V-Parameterization, are all the same thing. While I don't fully understand it at a technical level, as far as I am aware it is an optimization to the noise schedulers that "predicts" outputs during image generation, allowing for a final result to be generated in fewer steps.
Q: What is Min SNR? zSNR? Zero Terminal SNR? Are they the same, or different?
A: No, while similar, they do rather different things. To keep it simple, zSNR (Zero Terminal SNR) is a technique that allows for the AI to generate using a wider color space, including perfect blacks. Think of it like the difference between a normal monitor and a HDR OLED monitor. Min SNR is a method of accelerated training convergence, which allows models to train in fewer steps.
Q: Could I train at a resolution higher than what my training model can do?
A: Can you? Yes. Should you? No. While normally higher resolutions are a tradeoff of quality for speed, in this case you would be trading speed for worse results. Without getting technical, training larger than your model can handle is not good for your outputs.
Q: You mentioned not to "overtag" your images, but how many is too many?
A: This will really depend on your dataset and training settings. Longer trainings can help with overtagging, but run a greater risk of overtraining. Generally, try and keep your per-image total to 20 or below on average, but having outliers with more isn't the worst. Try and avoid tags that aren't important to the image (unless you're finding that the results are clinging to something too much, in that case tag it), and tags that your model has little to no knowledge of. Empty tags are seen as training targets, and will try to be filled. If filled with the wrong data, you can end up with seemingly random tags being required to get the intended result.
Q: What's the difference between a LoRA and a LyCORIS? Are they even different?
A: Every LyCORIS is a LoRA, but not every LoRA is a LyCORIS. LyCORIS specifically refers to a subset of newer LoRA training methods (LoCon, LoHa, LoKR, DyLoRA, etc.). All of these are still LoRAs, but their new methodologies make them structurally different enough to have their own designation. Now that most GUIs have built-in support for them, to an end user they functionally make no difference in their usage. LoRA on its own simply refers to the original method.
Q: My LoRA kinda works, but has very strange, distorted anatomy at times. What happened?
A: More often than not, distorted anatomy originates from your dataset. Look it over for images that are similar to the distortions you are seeing. Uncommon poses, strange camera angles, improperly tagged duo/group images, and other outliers can be likely causes. Try tagging what's applicable, but it's usually best to remove the image entirely or crop out the parts causing issues, if possible.
Q: I've heard a bit about single-tag training, what is it?
A: Training with a single tag is a very old method commonly used by beginners who don't want to spend time tagging. When training to a single tag, the AI will "homogenize" everything it learns from an image into the tag, resulting in highly generalized outputs. This will only even begin to work if every image is of a specific subject (like a character), and has a very high likelihood of latching on to specific backgrounds, poses, and other unwanted variables. If used with anything else that isn't repetitive, you'll end up with what is effectively digital mush. I would not recommend this for any application.
Part 6 | Advanced Training: Multi-Concept LoRA
So you've got your feet wet, and want more of a challenge? Or maybe you've got a character with many outfits? Gender-specific armor? That's where multi-concept training comes in.
The actual training settings for these are almost exactly the same compared to normal LoRAs, with a few caveats:
Do not use a batch size higher than 1. If images from multiple concepts get loaded, they'll generalize into mush, or you'll have one overpower the other.
Be careful with using flip augmentation, as it will apply to every image, not just one concept.
Depending on how many concepts you're training, and how complex they are, you may want to increase your Rank and Alpha values. I recommend trying 32 first and seeing how it performs.
Now, gather your images the same way I detailed before, but separate them based on their concepts (outfits, armors, etc). Any editing, too, should be done like before.
Once you've fully prepared your data, figure out which concept has the most data, and in your concept folder, create a 1_conceptname folder for it.
Now, do the same with your other concepts, obviously replacing "conceptname" with their activator tag.
Once you have your folders named and filled, do the following:
Take the number of images in your largest folder, multiply them by the "X_", and then multiply them by your intended number of epochs to get your total step count. ((images*folder repeats)*epochs) = steps
Now, divide the step count by the number of images in your second largest folder. The resulting number, rounded to the nearest whole, is the number that that folders "X_" should be changed to.
Repeat this for every applicable folder.
Why do this, you ask?
We do this to balance the dataset. If you keep everything the same, the folder with the most images will dominate the training, leaving the other concepts with a fraction. We balance the dataset to ensure every concept gets equal training time, which prevents one from dominating and the other concepts from undertraining.
You should keep in mind, however, if you have very few images in a concept folder that individual concept could overbake, even if the rest of the LoRA is fine. This is a bigger issue the larger the discrepancy between it and the largest folder is.
Now that your folders are balanced, we should look at how you name them, and what your activator tag for each will be.
If you're training a character with multiple outfits, name your folders like "1_charactertag, outfittag". Your first two tags should be those, in that order.
If you're training something not tied to a character, like gendered armor, I usually just create a tag for each version. For example, "armortagm" and "armortagf" for males and females respectively. Just like before, these should be the first tag on their respective images.
Now that your names and activator tags are settled, you can start tagging! This can be done just like a normal lora, you've just got a whole lot more images to go through.
And that's it! once you've tagged, you can train it just like before. You'll likely have much longer training times, given the increase in images, but in the end you'll have multiple concepts in a single LoRA to use as you please.
Part 7 | Advanced Training: LyCORIS & Its Many Methods
LyCORIS gets more advanced by the day, and as it increases in commonality I feel it best to have a section talking about it. This will be slightly more technical than the rest, but I'll try to keep it to the "need-to-know" stuff.
LoCON: A LoRA with that also affects the convolution layers of the base model, allowing for more dynamic outputs.
LoHa & LoKR: A LoRA that essentially is two different versions of itself, which are combined/averaged by Hadamard Product and Kronecker Product respectively. They take longer to train, and are more oriented towards generalized training.
DyLoRA: Short for Dynamic LoRA, this is a LoRA implementation that allows the Rank to change dynamically, but is otherwise a normal LoRA.
GLoRA: Short for Generalized LoRA, this is an implementation that is made for generalizing diverse datasets in a flexible and capable manner.
iA3: Instead of affecting rank like most LoRA, iA3 affects learned vectors, resulting in a very efficient training method. Similar (seemingly a bit better?) to a normal LoRA, in a much smaller package.
Diag-OFT: This implementation "preserves the hyperspherical energy by training orthogonal transformations that apply to outputs of each layer". I genuinely have no clue what this is supposed to mean, but I think it's better preserving the base models original understanding of items that are coincidental to the training (like backgrounds and poses). I could be completely wrong, though, let me know if I am. This also apparently converges (trains) faster than a standard LoRA.
Native Fine-Tuning: Also known as dreambooth, which we aren't focusing on and will ignore for this guide. The LyCORIS implementation allows it to be used like a LoRA, but it produces very large files.
"So, what should I use?"
I would personally say each has their own uses, so I've categorized them semi-generally. I'm still not super knowledgeable about their intricacies, but I've largely based these on their implementation notes and documentation. What you choose is up to you and entirely based on your needs.
LoCON, DyLoRA, iA3, Diag-OFT
LoCON, LoHa, LoKR
LoCON, LoHa, LoKR, GLoRA
LoCON, GLoRA, iA3
Benefits, Drawbacks & Usage Notes:
Affects More Model Layers
Slightly Larger Files
Basically Just A LoRA, But Better
Dim <= 64 Max, 32 Recommended
Alpha >= 0.01, Half Recommended (When not using an adaptive optimizer)
LoHa & LoKR:
Good With Multi-Concept Training
Good With Generalization
Longer Training Times
Bad With Highly Detailed Concepts
Can Be Hard To Transfer
Dim <= 32
Alpha >= 0.01, Half Recommended (When not using an adaptive optimizer)
Small: Factor = -1
Large: Factor = ~8
Automatically Finds Optimal Rank
Longer Training Times
Otherwise Just A LoRA
Use with large (~128) Dim, Half Alpha (When not using an adaptive optimizer)
Use Gradient Accumulation
Batch Size of 1 Max
Very Good At Generalization (Styles & Concepts)
Shorter Training Times (?, To Test)
Not Very Good At Training Non-Generalized Subjects
Very Small File Sizes
Generally Performs Better Than LoRA
Good With Styles
Can Be Hard To Transfer
Use with High LR (When not using an adaptive optimizer), official implementation recommends 5e-3 (0.005) ~ 1e-2 (0.01)
Faster Training Time
Better Preserves Coincidentals (?, Unsure)
Part 8 | Advanced Training: Styles & Themes
So, you want to train a style of some kind. Regardless of what it is, for broader concepts a LyCORIS is the tool for the job, but unlike a LoRA, there are several kinds of LyCORIS to choose from. If you skipped Part 7, I recommend a LoCON, GLoRA, or iA3.
Once you've chosen your type, make sure your rank is set to 32 or lower. LyCORIS seems to have some issues above certain points (though you can go as high as 64 I believe), but 32 is the generally agreed upon maximum before you start getting issues.
Now that that's out of the way, you should start building a dataset, just like before. However, style trainings benefit much more from larger datasets, so instead of the 15-50 range from before, look to get around 50-200, in my experience 125-150 is a good place to be.
Once you've got your images, start tagging. You can generally tag the same way as before, but keep in mind that you want the style, not a character or article of clothing. You should especially be sure to tag backgrounds, clothing, and any other key element.
After tagging, you're good to start training. In my experience, these usually take fewer epochs to train compared to a LoRA: While I recommend ~100 repeats for a LoRA, these are usually ok with ~30-40 repeats, but your mileage may vary, given the size and composition of your dataset.
Updated experimental settings, added more details to part 4, & added a brief section regarding some new findings to part 2.
Added part 2.5, a subsection regarding Nightshade and other AI "poisons".
Moved part 7 to part 8 & removed LyCORIS explanation.
Added (new) part 7, going more in-depth on LyCORIS.
Tweaked some experimental parameters.
Tweaked experimental settings & added some explanations to some values.
Added Q&A questions.
Expanded on the "scale weight norms" value in part 4.
Corrected sections regarding minsnr and zsnr to differentiate them correctly.
Tweaked "additional parameters": Value no longer required.
Added experimental settings to part 4.
Changed title to include LyCORIS.
Added Q&A question.
Correction of more grammar errors.
Slightly expanded Part 1 & 2.
Added section covering implied tags to Part 3.
Added minor elaborations to some areas.
Correction of minor grammar errors in parts 3 & 4.
Added new Q&A questions.
Added parts 6 & 7, covering Multi-concept and Style training respectively.
Added part 3.5 for tagging examples, added two to begin with.