First of all this was originally a guide I created as requested by a discord Channel(it has changed quite a bit), it is also posted in rentry(That one is out of date.) This guide is relatively low granularity and is mostly focused towards character Lora/Lycoris creation, it is also geared towards training at 512x512 and SD 1.5, my video card is too crappy for SDXL training. Ask in the comments if any part need clarification. I will try to respond(If I know the answer :P) and add it to the guide. Lycoris creation follows all the same steps a lora creation but differs in the selection of training parameters, so the differences will be in the baking section.
Edit (20231003): Did some tests on ia3 or (IA)^3. It is not as good as some claim nor as bad as some fear. All in all I think this lora type is good for fast prototyping and dataset debugging. Here is the trained IA3(It is less than 300kb!), sample images and my conclusions: https://civitai.com/models/155849/iori-yoshizuki-ia3-experiment-version-is
Also added a bit more on bucketting.
Making a Lora is like baking a cake.
Lots of preparation and then letting it bake. If you didn't properly make the preparations, it will probably be inedible.
I am making this guide for offline creation, you can also use google collab but I have no experience with that.
===============================================================
Preparations
First you need some indispensable things:
A nvidia videocard with at least 6GB but realistically 8 GB of VRAM. Solutions for ATI cards exist but are not mature yet.
Enough disk space to hold the installations
A working installation of Automatic1111 https://github.com/AUTOMATIC1111/stable-diffusion-webui or another UI and some Models (for anime it is recommended to use the NAI family: NAI, AnythingV3.0, AnythingV4.0, AnythingV4.5) I normally use AnythingV4.5 https://huggingface.co/andite/anything-v4.0/blob/main/anything-v4.5-pruned.safetensors
A collection of images for your character. More is always better and the more varied the poses and zoom level the better. If you are training outfits, you'll get better results if you have some back and side shots even if the character face is not clearly visible, in the worst case scenario some pics of only the outfit might do, just remember to not tag your character in those image if she is not visible.
Kohya’s scripts, a good version for windows can be found at https://github.com/derrian-distro/LoRA_Easy_Training_Scripts, The install method has changed, now you must clone the repository and click the install.bat file. I still prefer this distribution to the original command line one or the full webui ones, as it is a fine mixture between lightweight and easy to use.
===============================================================
Dataset gathering
That’s enough for a start Next begins the tedious part dataset gathering and cleanup:
First of all gather a dataset. You may borrow, steal or beg for images. More than likely you'll have to scrape a booru either manually or using a script, for rarer things you might end up re-watching that old anime you loved when you were a kid and going frame by frame doing screencaps, mpc-hc has a handy feature to save screencaps to png with a right click->file->save Image also you can move forwards and backwards one frame by doing ctrl+left or right arrow. For anime, this guide lists a good amount of places where to dig for images: Useful online tools for Datasets, and where to find data.
Get all your images to an useful format that means preferably png but jpg might suffice. You can use the powershell scripts I uploaded to civitai or do it yourself using the steps in the next entry (3) or the script at the bottom.
For gif i use a crappy spliter open source program or ffmpeg. For webp I use dwebp straight from the google libraries, dump dwebp from the downloaded zip into your images folder, open cmd in there and run <for %f in (*.webp) do dwebp.exe "%f" -o "%~nf.png"> that will convert all the webp images into pngs. For avif files get the latest build of libavif (check the last successful build and get the avifdec.exe file from the artifacts tab) then dump it in the folder and run it the same as for webp <for %f in (*.avif) do avifdec.exe "%f" "%~nf.png"> remember to remove "<" and ">". For gifs and ffmpeg use <for %f in (*.gif) do ffmpeg.exe -i "%f" "%~nf%04d.png">
Unless you are making a huge LORA which accounts for style then remove from your images dataset any that might clash with the others, for example chibi or Superdeformed versions of the characters. This can be accounted for by specific tagging but that can lead to a huge inflation of the time required to prepare the LORA.
Exclude any images that have too many elements or are cluttered, for example group photos, gangbang scenes where too many people appear.
Exclude images with watermarks or text in awkward places where it can’t be photoshopped out.
Ok so you have some clean-ish images or not so clean ones that you can’t get yourself to scrap. The next "fun" part is manual cleaning. Do a scrub image by image trying to crop, paint over or delete any extra elements. The objective is that only the target character remains in your image (If you character is interacting with another for example having sex, it is best to crop the other character mostly out of the image). Try to delete or fix watermarks, speech bubbles and sfx text. Resize small images and pass low res images through waifu2x(see next section's item 4) or img2img to upscale them. I have noticed blurring other characters faces in the faceless_male or faceless_female tag style works wonders to reduce contamination. Random anecdote: In my Ranma-chan V1 LORA if you invoke 1boy you will basically get a perfect male Ranma with reddish hair as all males in the dataset are faceless.
That is just an example of faceless, realistically if you were trying to train Misato, what you want is this(Though I don't really like that image too much and i wouldn't add it to my dataset, but for demonstration purposes it works):
===============================================================
Dataset Regularization
The next fun step is image regularization. Common training images for SD are 512x512 pixels. That means all your images must be resized, or at least that was the common wisdom at the time I started training. I still do so and get good results but most people resort to bucketing which is allowing the LORA training script to autosort the images in size buckets. Also the common consensus is that too many buckets can cause poor quality in the training. My suggestion? Either resize everything to your desired training resolution or chose a couple of bucket sizes and resize everything to their closest appropriate bucket either manually or allowing upscaling in the training script.
One technique is to pass it though a script which will simply resize them either though cropping or resizing and filling the empty space with white. IF you have more than enough images probably more than 250. This is the way to go and not an issue. Simply review the images and dump any that didn’t make the cut. This can be accomplished in A1111 in the Train->preprocess images tab
If you on the other hand are in a limited image budget, I would recommend doing this manually. Windows paint3d is adequate if not a good option. Just go to the canvas tab and move the limits of your image. Why do this? Because you can commonly get several subimages from a single image as long as it is high res. For example suppose you have a highrez fullbody image of a character. You can resize and fill the blanks to get 1 image. Do a cut at the waist and do a portrait shot for a 2nd image. Then make a cut at the neck for a 3rd image with a mugshot only.
To simplify things a bit i have created a script(It's at the bottom of the guide) which makes images square by padding them. It is useful to preprocess crappy (sub 512x512 resolution) images before feeding them to the upscaler. Just copy the code into a text file and rename it to something.ps1 then right click it and click run with powershell.
For low res upscaling, my current prefered anime scaler is https://github.com/xinntao/Real-ESRGAN/blob/master/docs/anime_model.md (RealESRGAN_x4plus_anime_6B). I have had good results for low res shit to 512. Just drop the model inside the A1111 model\ESRGAN folder and use it from the extras tab. Alternatively, I have found a good anime scaler that is a windows ready application https://github.com/lltcggie/waifu2x-caffe/releases Just download the zip file and run waifu2x-caffe.exe Then You can select multiple images and upscale them to 512x512. For low res screen caps or old images i recommend the "Photography, Anime" model. You can apply the denoise before or after depending on how crappy your original image is.
Some extra scalers and filters can be found in the wiki. https://upscale.wiki/wiki/Model_Database. 1x_DitherDeleterV3-Smooth-[32]_115000_G is good for doujins with heavy dithering(or in our case to cover up errors when cleaning sfx). 1x_NoiseToner-Poisson-Detailed_108000_G works fine to reduce some graininess and artifacts on low quality images. As the 1x indicates this are not scalers and probably should be user as a secondary filter in the extra tabs or just by themselves without upscaling.
Img2Img: Don't be afraid to pass an extremely crappy image into img2img to try to get it less blurry or crappy. Synthetic datasets are a thing so don't feel any shame from using a partial or fully synthetic one. That's the way some original characters LORAs are made. Keep in mind this, if you already made a LORA and it ended up crappy... Use it! Using a lora from a character to fix poor images of said character is a thing and it gives great results. Just make sure the resulting image is what you want and remember any defects will be trained into your next LORA (that means hands, so use images where they are hidden or you'll have to manually fix them.) When you are doing this, make sure you interrogate the image with a good tagger, deepdanboru is awful when images are blurry(it always detects it as mosaic censorship and penises). Try to add any missing tags by hand and remember to add blurry and whatever you don't want into the negatives. I would recommend to keep denoise low .1~.3 and to iterate on the image until you feel comfortable with it. The objective is for it to be clear not for it to become pretty.
To remove backgrounds you can use stable-diffusion-webui-rembg install it from the extension tabs and will appear at the bottom of the extra tab. I don't like it. Haven't had a single good success with it. Instead i recommend transparent-background which is a lot less user friendly but seems to give me better results. I recommend you reuse your a1111 installation and put it there as it already has all requirements. Just open a power shell or cmd in stable-diffusion-webui\venv\Scripts and execute either Activate.ps1 or Activate.bat depending on if you used ps or cmd. Then install it using: pip install transparent-background After it is installed run it by typing: transparent-background --source "D:\inputDir\sourceimage.png" --dest "D:\outputDir\" --type white
This particular command will fill the background with white you can also use rgb [255, 0, 0] and a couple extra options just check the wiki part of the github page https://github.com/plemeri/transparent-background
===============================================================
Folders
So All your images are neat, 512x512 or a couple of buckets. The next step is the Folder structure, images must be inside a folder with the following format X_Name where X is the amount of times the image will be processed. You’ll end with a path like this train\10_DBZStart where inside the train folder is the folder containing the images. Regularization images use the same structure. You can have many folders, all are treated the same and allow you to keep things tidy if you are training multiple concepts like different outfits for a character. It also allows you to add higher processing repetitions to high quality images or maybe a tagged outfit with very few images. For now just set everything to 10 repetitions, you will need to tweak these number after you finosh sorting your images into the folders.
In the example below I tagged 6 outfits, the misc folder has well misc outfits without enough images to be viable. I adjusted the repetitions depending on the amount of images inside each folder to try to keep them balanced. Check the Repetitions and Epochs section to adjust them.
So after you finish the structure it is time to sort your images into their corresponding folders. I recommend that if the shot is above the heart, to dump it into misc as it won't provide much info for the outfit, those partial outfits in misc should be tagged with their partial visible parts rather than the outfit they belong to. That is unless they have a special part not visible in the lower part of the outfit, in this case leave them as is and treat them as normal part of the outfit and just be cautious of not overloading the outfit with mugshots. Random suggestion: For outfits, adding a couple of full body headless(to make the character unrecognizable) shots tagged with "1girl, outfitname, etc" do wonders to separate the concept of the outfit from the character.
===============================================================
Repetitions and Epochs
Setting Repetitions and Epochs can be an issue of what came first the chicken or the egg. The most important factors are the Dim, optimizer, the learning rate and the amount of epochs and the repetitions. Everyone has their own recipes for fine tuning. Some are better some are worse.
Mine is as generic as it can be and it normally gives good results when generating at around .7 weight.
I use Dim 32, AdamW with a learning rate of .0001. I strive to do 8 epochs of 1000 steps per epoch per sub concept. Alternatively use Dim 32, Prodigy with a learning rate of 1 with 500 steps per epoch per sub concept.
WARNING: Using screencaps while not bad per se is more prone to overbaking due to the homogeneity of the datasets, so 1000 is likely way too high. either pad your dataset with fanart or tweak down the repetitions. Try 500, if it is still too high try 200. If it is still overbaking or your character became unrecognizable due to poor quality, it is probably better to set it to 500 and try prodigy.
Now what does steps per epoch actually means?
Suppose I am making a character LORA with 3 outfits. I have 100 outfit1 images, 50 outfit2, 10 outfit3.
I would set the folders repetitions to be:
10_outfit1 = 10 reps * 100 img =1000 steps
20_outfit2 = 20 reps * 50 img =1000 steps
100_outfit3 = 100 reps * 10 img =1000 steps
Remember you are also training the main concept when doing this, in the case above this results in the character being trained 3000 steps. So be careful not to overcook it. The more overlapping concepts you add the higher the risk of overcooking the lora.
This can be mitigated by removing the relation between the character and the outfit. Take for example outfit1 from above, I could take 50 of the images and remove the character tag and replace it for the original description tags(hair color, eye color, etc). that way when outfit1 is being trained character is not. Another alternative that somewhat works is using scale normalization that "flattens" values that shoot too high beyond the rest limiting overcooking a bit. The final method to keep overcooking under control is using the Prodigy optimizer which should make thing less prone to overcook, but i am still testing it.
Warning: Don't use Scale normalization with prodigy as they are not compatible.
For the main concept (character in the example) I would recommend to keep it below 6000 steps per epoch before you need to start tweaking the dataset to keep it down and prevent it from burning.
Remember these are all approximations and if you have 150 images for one outfit you can leave it be and leave it at 10 repetitions. If your Lora is a bit overcooked, most of the time it can be compensated by lowering the weight when generating. If your LORA starts deepfrying images at less than .5 weight i would definitely retrain, it will still be usable but the usability range becomes too narrow. There's also a rebase script around to change the weight so you could theoretically set .5 to become 1.1 thus increasing the usability range.
TLDR? Choose dim 32, LR .0001, optimizer adamw, 8 epochs and make sure the number in the folder = 1000/#images_in_the_folder. (Don't add decimals it is the result rounded up to the nearest integer.) It got overcooked? Lower the repetitions/learning rate, enable normalization or try Prodigy with 500 steps per sub concept per epoch.
===============================================================
Captioning
So all your images are clean and in nice 512x512 size, the next step is captioning. Captioning can be as deep as a puddle or as the Marianas trench. Captioning(adding tags) and pruning(deleting tags) are the way we create triggers, a trigger is a custom tag which has absorbed (for lack of a better word) the concepts of pruned tags. For anime characters it is recommended to use deepboru WD1.4 vit-v2 which uses danbooru style tagging. The best way i have found is to use diffusion-webui-dataset-tag-editor(look for it at extensions) for A1111 which includes a tag manager and waifu diffusion tagger.
Go to the stable-diffusion-webui-dataset-tag-editor's A1111 tab and select a tagger in the dataset load settings, select use tagger if empty. Then simply load the directory of your images and after everything finishes tagging simply click save. Alternatively Go to A1111 in the Train->preprocess images tab and tick use deepbooru for captions and process them.
Caption cleaning: Before starting the Trigger selection it is best to do some tag cleaning(make sure to ignore tags that will be folded into triggers as those will likely be pruned). Superfluous tags are best served in the following ways:
Delete: For useless tags like meme, parody, sailor_moon_redraw_challenge_(meme), shiny_skin, uncensored(I make an effort to always prune uncensored as I want my loras to remember that is the default state). Also for mistaken identities in case your character is identified as another character or an object is miss identified.
Consolidate: For generic tags for example "bow" is best dealt by replacing it with it's color + part equivalent like black_back_bow or red_bowtie and deleting the associated individual tags. This mostly applies to hair, clothes, backgrounds and the "holding" tag.
Split: Also for generic tags, for example "armor" is best split into pauldron, gorget, breastplate, etc. Jewelry, makeup and underwear are common offenders.
Synonyms: One version should be chosen and the other consolidated for example "circlet" and "tiara" most taggers will pick up both.
Evaluate: These can boost or corrupt the training concept. For example if you are training a character and it is recognized. If the model response is mild when generating an image of it, then it can be used as the character trigger to boost it. If on the other hand it is already strongly trained it will likely cause your LORA to overcook. So either use them as triggers or delete them. You don't have to worry if they just happen to exist in your dataset and you are not training for them (for example if you are training Gotham city architecture and it recognizes batman).
Now you must decide which tags you are going to use as a trigger for your LORA. There's 3 types of "contamination" your trigger can get from the the model:
Negative contamination: Take for example you wish to make a Lora for Bulma in her DBZ costume. So you choose the tag “Bulma_DBZ”. Wrong! If your character is unknown there is no issue but if you choose a famous character like bulma you will get style contamination from the word “Bulma” and the word “DBZ”. In the case of Bulma, her style is so deeply trained in most anime models that it will likely overcook your LORA simply by being associated to it. Remember that underscores, dashes and hyphens are equivalent to spaces for the danbooru notation and even if partial you might get some bleed over due to tangentially invoking their concepts.
Positive contamination: On the other hand this contamination can be beneficial specially for outfits. Take for example the following trigger Green_Turtleneck_Shirt_Blue_Skirt, as it is not completely concatenated it will get a bit of contamination from each one of the words forming it. This can be very useful to boost triggers for outfits in which you only have a few images. Just make sure to pass it trough you model and that it produces something similar to what you are trying to train.
Noise: If when you pass your trigger trough your model it produces something different every time then that means the trigger is "free" or untrained in the model and it is perfect to use for your lora if you want to minimize outside interference.
In summary: before you assign a tag for a trigger, run it trough A1111 and check it returns noise or it lightly boosts your needs. In the Negative contamination example I could concatenate it to BulmaDBZ or what I did which was to use the romanji spelling Buruma. An alternative way to reduce this problem is the usage of regularization images but I will speak about them later.
The next part is tag pruning either use tag editor or manually go to the folder in which A1111 tagged your images. You must remove any tags in which your character was recognized. I do this using using either the tagger replace function, bulk remove or manually using notepad++ search->find in files option and doing a replace for example “Bulma, ” in exchange for “”. Also you may want to clean up erroneous or superfluous tags, a good way to simply determine which caption file needs cleaning is to go to the detailed view of the folder and note any caption file bigger than 1kb (as when the taggers fail they often produce a lot of tags). Using the tag editor this task is easier just give a glance to the tag list and check the most outlandish ones and click on them, the tab will filter the images and show the offending ones, then you can delete or change the tag to an appropriate one.
If you are obsessive or want to do something "fancy" like tagging specific wardrobe combinations, a specific hairstyle or weapon, you will want to remove all tags for the individual parts of her costume or item in question. For example if some character uses a red dress, red high heels and a yellow choker. You must delete these individual tags and replace the whole of them for a customized “OutfitName” tag. Or maybe you are a sensible person(Noob) and just want your character to appear and let the other tags do their job by themselves. So anyway after you deleted any problematic tags it is time to insert your character tag. What I do is select the folder with the captions and do ctl+shift+right click and select open a powershell window here. There you can run the following command: <foreach ($File in Get-ChildItem *.txt) {"Tohsaka_Rin_Alt, " + (Get-Content $File.fullname) | Set-Content $File.fullname}>
In the command you need to change “Tohsaka_Rin_Alt, ” for your trigger tag or tags. The command will insert the new tags at the beginning of every caption file. I prefer this approach even when using the tag editor as it will undoubtedly insert my trigger as the first tag in all files, it might be superstition (or maybe not) but i like it that way.For trigger tagging you can go three routes, the lazy route, the "proper" route and the custom route:
In the lazy route just don't prune any of the tags and just add the trigger tag to all the images. The benefit of the lazy route is that the user will be able to change pretty much anything of the appearance of the character. The downside is that if the user only uses the trigger, the character will only vaguely resemble itself(as the trigger only absorbed a small part of all the other concepts) and will require extra support tags like eye color and hair style to fully match it's correct self.
The proper way is to add the trigger tag to all images and then prune all intrinsic characteristics of the character like eye color, hair style (ponytail, bangs , etc), skin color, notable physical characteristics, maybe some hair ornaments or tattoos. The benefit is that the character will appear pretty much as expected when using the trigger. On the other hand it will fight the users if they want to change hair or eye color.
The custom("Fancy") way relies on knowing what to prune, For example if a character can change her hair color then don't prune it. In my case I never prune breast size as people will begin to complain: "Why are Misato's boobs so large!" To which you will inevitably have to reply "just tag her small_breasts!" or "Big boobs are big because they are filled with the dreams and hopes of mankind!" This last method is obviously the best, the only limitation is the dataset, specially if you wish to add triggers for outfits. If your character only uses one outfit(being nude counts as an extra outfit for this), forget about tagging for it, it is nearly impossible as the outfit trigger will mix with the character trigger. If you have at least 2 outfits appearing in your dataset, just make sure to account for overlap, for example if both outfits have a ribbon, it is likely the character trigger will get the concept of ribbon instead of the individual outfits so it would be best to no prune ribbon. Anyway it is a game of math, if A is the character trigger and B and C are outfits, If there's only A and B, and both appear in the same images, both will be the same and will split the pruned concepts equally. If you have the 3 of them A will eat the pruned concepts shared between all images while B will eat the ones pruned in the b group and that don't appear or are un-pruned in the C group and C will eat the ones pruned in the C group and were not in the B group or are un-pruned in the B group.
Example of the custom way: You have 3 groups of images of a character. In each group the character uses a different colored dress, two of the dresses variants also contain a sword.
Suppose you want to trigger for CharacterDress and character. As all the images have a dress so you prune the character characteristics and delete the dress tag, then you add both triggers to all images. The result is that both triggers do exactly the same. Look at the first example in the image above.
Suppose you want to trigger for Outfit1, Outfit2 and character. As all the images have a dress so you prune the character characteristics and delete dress tags in group 1 and 2, then you add the character trigger to all images and the Outfit tags to their respective group. The result is that since group 3 doesn't have it's dress tag pruned, the Lora knows dress is not part of the character trigger, as for Outfit triggers there is no problem with them as they have no overlap.
Suppose you want to trigger for ChDress1, ChDress2, ChDress3, CharacterSword and character. All the images have a dress so you prune the character characteristics and delete all dress tags, then you add the character trigger to all images and the characterdress tags to their respective group. You also add the sword trigger to the images it applies. The result is that since all groups had their dress tag pruned, the Lora thinks dress is part of the character trigger, the chdress triggers will be diluted and they might not trigger as strongly as they should, thankfully color and other hidden-ish tags will also make a difference but the result will be a bit watered down than if you had perfect concept separation. The character will also have a tendency to appear in a dress regardless of the color. As for the CharacterSword trigger, since it appears in a well defined subset of images it should trigger properly. A fix for this situation is commonly a misc group of images with unpruned tags to clearly teach the lora that those ribbons, dresses, accessories, etc are not part of the Character trigger.
The custom way can quickly escalate in complexity depending on the amount of triggers you are creating, it also requires a lager dataset with hopefully clearly delimited boundaries. Also it might require some repetitions tweaking(See the Folders section) to boost the training of triggers with less images. For example if you have 20 images of outfitA, 10 of OutfitB and 5 of outfitC, it would be best to sort them in folders as 10_OutfitA, 20_outfitB and 40_OutfitC that way all will get approximately the same weight in the training (20*10=10*20=5*40).
This step is only for when you are training a style. If you are training a style you will need as many varied pictures as possible. For training a style, captions are treated differently, what you want to do is either delete all tags from the caption files optionally only adding the trigger and let the LORA take over when it is invoked or to add a tag to act as trigger and only eliminate the tags that are associated to the style(for example Retro style, a specific background, perspective and tags like that) and leave all particular tags alone(1girl, blue_hair, etc). It is simpler to simply eliminate all tags.
This is only if you are training a concept. First of all you are shit out of luck. Training a concept is 50% art and 50% luck. First make sure to clean up your images as good as possible to remove most extraneous elements. Try to pick images that are simple and obvious about what is happening. Try to pick images that are as different as possible and only share the concept you want. Tagging is similar to the trigger variant for style. You need to add your trigger and eliminate all tags which touch your concept, leaving the others alone.
===============================================================
Regularization images
So you now have all your images neatly sorted and you wonder what the heck are regularization images. Well regularization images are like negative prompts but not really, they pull your model back to a "neutral" state. Either way unless you really need them ignore them. Normally they are not used in LORAs as there is no need to restore the model, as you can simply lower the LORA weight or simply deactivate it. There's some theoretical uses below are a couple of examples.
To me they only have two realistic uses:
Mitigate the bleed over from your trigger. Suppose I want to train a character called Mona_Taco The result will be contaminated with images from the Monalisa and tacos. So you can go to A1111 and generate a bunch of images with the prompt Taco and Mona and dump them into your regularization folder with their appropriate captioning. Now your Lora Will know That Mona_Taco has nothing to do with the Mona Lisa nor Tacos. Alternatively simply use a different tag or concatenate it, MonaTaco will probably work fine by itself without the extra steps. I would still recommend to simply use a meaningless word that returns noise.
Another use is for style neutralization, suppose you trained a character lora with a thousand images all including the tag 1girl. now whenever you run your lora and put 1girl it will always display your character. So to prevent this you should put a thousand different images of different girls all tagged as 1girl to balance out your training and remove your lora influence from the 1girl concept. Of course you have to do this with as many affected tags as you can.
As far as I understand it Consider the model as M, L is the LORA. L consists of NC and MC where NC are new concepts and MC are modified concepts from the model. Finally we have R from the regularization images, R is part of Model as it was created by images inherent to the model. If done right, R is also hopefully the part of M that is being overwritten by the MC part of the LORA.
Without regularization images
L = M + NC + MC - M = NC + MC
With regularization images
L = M + (NC + MC) - M - R
but we tried to make NC equivalent to R thus
L = M + NC - M = NC
Now a more concrete example We have a dataset that teaches 1girl, red_hair and character1, the model already knows red_hair and 1girl, but they are different from the dataset thus they are not bold. The regularization images contain 1girl and red_hair from the model.
M= 1girl, red_hair, etc
NC= character1
MC=1girl, red_hair
Without Regularization images
L =1girl, red_hair, etc + character1 + 1girl, red_hair - 1girl, red_hair, etc
L= character1 + 1girl, red_hair
With regularization images
L =1girl, red_hair, etc + character1 + 1girl, red_hair - 1girl, red_hair, etc - 1girl, red_hair
L=character1 + 1girl, red_hair - 1girl, red_hair
L=character1 + remains of (1girl, red_hair - 1girl, red_hair)
I am unsure to what point "1girl, red_hair - 1girl, red_hair" cancel each other or mix. And there is obviously a true mathematical way do describe this, this is just a way to try to dumb it down.
===============================================================
Baking The LORA/LyCORIS
Now the step you’ve been waiting for: the baking. Honestly the default option will work fine 99% of time. You might wish to lower the learning rate for style. But anyway for a Lora you must open the run.bat from LoRA_Easy_Training_Scripts. I recommend to never start training immediately but to save the training TOML file and review it first. The pop up version of the script seems to have been replaced by a proper UI so give it a pull if you still have the run_popup.bat.
General Args:
Select a base model: Put the model you wish to train with. I recommend one of the NAI (The original or one of the "Any" or "AOM" originals or mixes) family for Anime. I like AnythingV4.5, (no relationship with anything V3 or V5). The prunned safetensor version with the checksum 6E430EB51421CE5BF18F04E2DBE90B2CAD437311948BE4EF8C33658A73C86B2A. There was a lot of drama because the author used the naming schema of the other anything models. Let me be honest, I simply like it's quasy 2.5D style(closer to 2D than to 2.5D), I find it better than V3 or V5 and it has better NSFW support.
SD2 Model: No. The NAI family is based on 1.5
Gradient: I think it is a vram saving measure but i am not sure exactly how it works
Seed: Just put your lucky number.
Clip skip: has to do with the text encoder layer if i remember correctly, most anime models use 2. Some people actually call it "the NAI curse" as it originated from their model. Most photo realistic models use 1.
Prior loss weight: no idea just leave it at 1
Training precision: choose fp16 should be the most compatible.
Batch size: Amount of images per batch depends on your vram, at 8 GB you can do 2 or 1 if you are using image flipping, so just select 1 unless your character is asymmetric(single pigtail, eyepatch, side ponytail, etc). The Prodigy optimizer uses more VRAM than AdamW so beware you might need to lower the batch size.
Token length: is literally the max word size for the tags, I have seen people using clip("A giant burrito eating a human in an urban environment") style strings in danbooru style tagging, don't be that person. I recommend long triggers only when you need to tap on positive contamination from the base model, specially for complex clothing that don't make much sense or the model struggles with colors or parts of it, like this: red_skirt_blue_sweater_gray_thighhighs_green_highheels. (Doing this should help stabilize the output, if you had used a generic trigger, it is a coin toss if the model will choose the correct colors for the outfit.)
Max training time: depends on your dataset. I Normally use 8 epochs for 10 repetitions(X_Name becomes 10_Name) for 100 to 400 images. Or between 8000 and 32000 steps. This is for adamW, for prodigy just cut it in half.
Xformers: To use the xformers library to boost training speed. I actually had some training speed increase with the last version to xformers, to install enter venv and do "pip install -U --pre xformers"
Cache latents: cache the images into vram to keep vram usage stable.
To disk: Caches the processed images latents into disk to save vram(might slow down things).
Comments: Remember to put your Triggers in the comment field. If someone finds your LORA in the wild they will be able to check the metadata and use it. Don't Be that person who leaves his orphaned LORAS around with no one being able to use them.
Data subsets:
Images folder dir: Select your images folder, the ones with the X_name format, the number of repeats should auto populate. To add more folders click add data subset at the top.
flip augment: If your character is symmetric remember to enable flip augment.
Keep token: it gives a bit more weight to the first tags inside caption file, unneeded if you sorted, or put the triggers at the beginning. You might want to set if you toggled shuffle captions. Remember the first tags in the file are processed first and absorb concepts first.
shuffle captions: not necessary unless you have an extremely homogeneous dataset.
Caption extension: the default is to store it as common txt files. I have yet to see a different one.
Regularization images: I explained above, the common answer is don't use them but if you do toggle the folder as a regularization folder in here.
Random crop: This one is old I think it is an alternative to bucketing and does a mosaic of a bigger image to process it. Not sure if it applies the caption equally to every mosaic. Mutually exclusive with cache latents.
color augment: I think this one tweaks the saturation values to better sample the images don't quote on that. Mutually exclusive with cache latents
Face crop: As far as I know it acts as an augment making a crop focused on the character face. It used to be mutually exclusive with cache latents(not sure now). Not sure about its reliability.
Caption drop out: I think this one begins dropping the caption with the lower presence in the image or by order. Not sure which.
Token Warmup: opposite of caption drop out. It begins training more and more tags as time passes. This one I think is in their order in the caption files.
Network args:
Type: here you can choose the type of LORA. LyCORIS require some extra scripts and take longer to train. Pick lora unless your card is fast and have the scripts needed to use Lycoris. Here's some details I know about LORa and from lycoris(might be wrong)
Lora: The normal one we all know and love.
Locon(Lycoris): it picks up more detail and that may be a good thing in intricate objects but keep in mind the quality of your dataset as it will also pick up the noise and artifacts more strongly. Has a slight edgee on multy outfit loras as the extra details help it diferentiate the outfits limiting bleedover a bit(very slight improvement).
Locon(Kohya): Older implementation of LOCON. I would expect it to be a bit worse but i haven't tried it.
Loha: smaller file sizes, seems to produce some variability in style(sharper? gothic?) that some people like.
Lokr: similar to Loha in smaller sizes uses a different algorithm.
IA3: smaller sizes, faster training and extreme style pick up. As it trains only a subset of the values a lora does, it is small and fast to train. It does fine with half of a lora steps, making it 200~300 steps per epoch for Prodigy and 500ish for adamW(I have only tested for prodigy). I tested the compatibility with other models and it is not as bad as claimed. All in all I don't like it as a final product, but for prototyping and debugging the dataset seems to be a great option due to how quick it is to train. Here're my results using prodigy https://civitai.com/models/155849/iori-yoshizuki-ia3-experiment-version-is As can be seen it strongly picked up the dataset style being mostly monochrome and colored doujins. I honestly like the LOCON and LORA results better as they absorb more of the base model filling any gaps.
Dylora: dynamic LORA is the same as normal LORa but it trains several levels of dim and alpha. Should be slower to train but the end LORA should allow you to use the lora as if you had trained the same lora multiple times with different parameters instead of solid ones. Thus making you able to pick the perfect combination.
Current Recommendations:
LoRA: Dim 32 alpha 16.
LoCON: either Dim 32 alpha 16 conv dim 32 and conv alpha 16 OR Dim 32 alpha 16 conv dim 16 and conv alpha 8. Don't go over conv dim 64 and conv alpha 32
LoHA: Dim 32 alpha 16 should work? Don't go higher than dim 32
LoKR: Very similar to LOHA Dim 32 alpha 16 should work? Don't go higher than dim 32. According to the repos it might need some tweaking in the learning rate so try between 5e-5 to 8e-5 (.00005 to .00009)
IA3: Dim 32 alpha 16 should work. Need higher learning rate currently recommended is 5e-3 to 1e-2 (.0005 to .001) with adamw. Prodigy works fine at LR=1(tested).
Dylora: For Dylora the higher the better (they should always be divisible by 4) but also increases the training time. So dim=64, alpha=32 seems like a good compromise speedwise. The steps are configurable in the dylora unit value, the common value is 4 dim/ alpha so after training you could generate 64/32, 60/32, 64/28...4/4. Obviously Dylora take a lot longer to train or everyone would be using them for the extra flexibility.
Network dimension: Has to do with the amount of information included in the LORA. As far as I know 32 is the current standard, I normally up it to a max of 128 depending on the amount of characters end outfit triggers in my loras. For a single character LORA 32 should be ok.
Network Alpha: Should have something to do with variability(not quite sure), rule of thumb use half the Dimension value.
Train on: Both, almost always choose both. Unet only, only trains on the images while text encoder trains only on the text tags.
conv settings: conv are for LOCON, I would recommend the same values as dim and alpha as long as dim is equal or under 64, if you are using higher values the max you should set set conv dim and conv alpha is to 64 and 32.
Dylora unit: it's the division of the dim and alpha available, If you use dim 16 and unit 4, you get to have a lora that can produce images as if you had dim16, dim12, dim8 and dim 4 loras.
Dropout: Just to drop parameters randomly to increase variability.
Block Weights: For if you want to add more granularity to the training phases. I haven't the foggiest as to what would be an optimum configuration for best quality (if there even is one) as the dataset has a huge impact. Here's a guide from bdsqlsz, it is the only one i know for block training.
Optimizer settings:
Optimizer: I currently recommend either Prodigy, AdamW or adamW8bit. If your Lora is in no risk of burning, I recommend to stick with AdamW. If on the other hard you are getting borderline due to dataset issues, Prodigy is the way to go to limit overbaking. For prodigy I would recommend to per keep total outfit repetitions to a maximum of 500 steps(Ie. 50 images 10 repetitions or 100 images 5 repetitions) per epoch as it uses a more aggressive learning rate. The quality levels between AdamW and prodigy seems about equal, in the linked image I compare adamW Lora vs adamW Locon vs prodigy Lora vs prodigy Locon and i have a hard time discerning if one of them is objectively better. Thus I have pretty much switched full time to prodigy as even though it takes some 25% longer per step, this is offset by only requiring half the steps per epoch than adamW which actually produces some considerable gains in training time. You may want to check the Prodigy Scheduler example TOML if you plan to use it.
AdamW: Training bread and butter and golden standard. It works fine at about 1000 steps per epoch per concept trigger.
Prodigy: Best adaptative optimizer, slower than Adam but it only requires half the steps per epoch(500) actually making it save some training time. It is like DAdapt but seems to actually deliver. It is an adaptative optimizer making it unnecessary to finetune the learning rate. I originally tried it due to a lora I was training that was getting too much contamination from the model making it overcook. I tried this vs normal vs normal with lowered training rate vs normal with reduced repetitions vs using normalization. I got the best result with Prodigy followed by AdamW using normalization and simple adamw at the very bottom. So I guess the hype is real. Prodigy requires to add extra optimizer args in the tab2, remember to first click add optimizer arg. These are the recommended args:
weight_decay = "0.01"
decouple = "True"
use_bias_correction = "True"
safeguard_warmup = "True"
d_coef = "2"
Scheduler: annealed cosine
It should look as below:
Dadapt: These optimizers try to calculate the optimal learning rate. I did some "unsuccessful" tests with DAdaptAdam, it worked but i just didn't like the results. This might change is the future when more test are done. These optimizers use very high learning rates that are calculated down on the fly. For my tests I used the repo recommendations: learning rate= 1 with constant scheduler and weight decay=1. Some other people recommend LR=.5 and WD= .1 to .8. These optimizer also took 25% longer training time. So I don't recommend the... yet. The idea of not needing to finetune the learning rate is alluring so hopefully they will work better in the future with some tweaking. DAdaptAdam also requires to add an extra optimizer arg in the tab2, remember to first click add optimizer arg. The first input should be "decouple" and the value should be set as "True"
Learning rate: For adam it is ok at default(.0001), lower it for style(.00001-.00009) for Prodigy it should be 1
Text encoder and Unet Learning rates: this are if you don't want a global one. I think if Unet is too high you get deep fried and if TE is too high you get gibberish(deformed).
Learning rate modifiers: Technically important as they manipulate the learning rate. In practice? Just select "cosine with restart" for adamW and for prodigy "annealed cosine with warmup restarts" gave me good results. I have seen some comparison and those produce fine results.
Num restarts: Set it to 1 restart for cosine with restarts. Some people recommend 3. YMMV.
Warm up ratio: for if you want a learning rate boost at the beginning. I don't use it, I suppose you could shave some training time at the risk of some unpredictability.
Minimum SNR gamma: seems to filter some noise/ tokens. I do seem to get a bit less noisy images when using it. If you use it, set it at 5.
Scale weight: As far as i know it tries to level the values of the new weight introduced by the lora to it's average value reducing peaks and valleys. Haven't tested it but might be good to reduce style, It might also kill special traits. Probably should be set to 1. This one works I don't like it but it works, best used in combination with lowered learning rate or fewer repetitions. DON'T USE IT WITH PRODIGY
Weight decay and beta: A far as i understand weight decay dampens the strength of a concept to normalize it, while beta is the expected normalization value. Some stuff i read mentions to decrease weight decay on big datasets and increase it on small ones. But i always leave it as is. Weight decay for prodigy should be lower, .01 works fine.
Saving args: Just make a new folder and put your stuff there
output folder: where you want to save your stuff
Output Name: Currently crashes if you don't enable it and give it a name.
Save precision: set it to fp16 as it is the most standard
Save as: safetensors, really this option shouldn't even be available by now.
Save frequency: depends on the amount of epoch you are training. I normally train at 8 epochs at medium repetition so each epoch is fine i will get 8 files. If you on the other hand train high epochs low repetition then you should change it to every 2 o or 10 or whatever you need.
Save ratio: I think it is the maximum number of allowed saved epochs.
save only last: same as last one just in case you fear your training will be interrupted and want to keep just a couple of earlier epochs.
Save state: literally saves a memory dump of the training process so it can be resumed later, useful in cases of disaster like a power outage or hardware/software failure. Or maybe a naughty cat typing ctl+c or alt+f4.
Resume state: the path of the save state you are resuming training from.
Save tag occurrence: YES! That stuff is useful for when you are creating stuff to get an idea of the available tags for your character in that particular lora.
Save TOML file: Yes, I always recommend to give it a glance before training to see you didn't fuck up.
Bucketing: As far as I know the less buckets the better, for example if you have a minimum of 256 and a maximum of 1024 and 64 steps in between you can have a maximum of 12 buckets ((1024-256)/64=12) per side, with the complementary side sizes that do not exceed the max total pixel of the training resolution 262144 (512*512) resulting in 47 potential buckets in total. In the image below are the valid combinations for 512 training, your image will be slotted in the biggest bucket it fits after being downsized. For example a 1920x1080 image it will be reduced until it fits the biggest bucket with a 16:9ish aspect ratio, so they will likely be resized to 640x360 (1920/3 and 1080/3) and slotted into bucket 640x384 as it is a good fit.
It is highly recommended to choose 4 or 5 buckets and resize your images to those resolutions as having too many buckets has been linked to getting blurry images.As you can see the training accepts images that go out of the stated max resolution in one side. The bucketing algo seems to do some matrix magic to process all the pixels as long the total count is below the max of 262144(for 512 training), instead of say making the biggest side 512 and shrinking the smaller side further.
Minimum Bucket resolution: minimum side resolution allowed for an image
Maximum bucket resolution: not sure if maximum resolution or if maximum resolution of the smaller side. It is likely the former.
bucket resolution steps: size increases between buckets.
Don't upscale images: does what it say in the tin, it won't upscale the image to it's nearest bucket and will instead pad it with white or Alpha(transparency).
Noise Offset: Literally, it just adds noise in case the images of the dataset are too similar. It can either increase training quality or... add noise.
Type: Normal homogeneous noise or pyramidal(starts low, ramps up and goes down)
Noise offset value: Amount of noise to add. The default seems to be .1. I don't normally add noise
Pyramid iterations: I guess a sawblade pattern of x iterations.
Pyramid discount: Think this is the slope of the pyramid.
Sampler Args: parameters for test images that will be produced every epoch. I don't normally enable it as this will slow things somewhat.
Sample, steps and prompt: If you don't know why did you read until this point of the guide? If you don't know go check a basic SD generation guide.
Logging Args: For analyzing the training with some tools. Honestly, at this point i suspect that if you screwed up, you would be best served by checking your dataset and tags than spending time researching and knowing that you need to lower your alpha by .000001. Useful for trying to find better parameters combinations not so much to troubleshoot that lora that keeps turning out ugly.
Settings: pretty much logging style and folders where to save.
Tensorboard is installed by default with the easy training scripts so you can run it from the venv in there.
There's Jiweji's guide in Civitai for a deeper explanation.
Batch training: You need to save the individual TOML Files, then load one by one, give each a name and click add to queue. When you add all the trainings just click start training.
I attached a TOML file of one of my trainings, you can load it and just edit the folders if you want. Remember to turn on or off the flip augment as needed.
Finally let it cook. It is like a cake if you peek, it will deflate. If you use the computer too much it might mysteriously lower it’s speed and take twice as long. So just step away go touch grass, stare directly at the sun. Scream at the neighbor kids to get off your lawn.
Finally your Lora finished baking. Try it a 1 weight or do an xyz graph with several weights. If it craps out too early go to a previous epoch. Congratulation you either finished or you screwed up.
===============================================================
Utility script
Below are some powershell scripts (I also uploaded them to civitai in their .ps1 form) useful to change the file extensions to png and to square and fill the empty space with white. The script can be edited to only change the format. Remember to paste it into a file with the .ps1 extension and run it via left click.
If you have downloaded and put ffmpeg.exe, dwebp.exe and avifdec.exe in the images folder you can add the following lines at the beginning of the script below to also support those file types.
#change gif to png
Get-ChildItem -Recurse -Include *.gif | Foreach-Object{
$newName=($_.FullName -replace '.gif',"%04d_from_gif.png")
.\ffmpeg -i $_.FullName $newName 2>&1 | out-null
}
#change webp to png
Get-ChildItem -Recurse -Include *.webp | Foreach-Object{
$newName=($_.FullName -replace '.webp',"_from_webp.png")
.\dwebp.exe $_.FullName $newName
}
#change avif to png
Get-ChildItem -Recurse -Include *.avif | Foreach-Object{
$newName=($_.FullName -replace '.avif',"_from_avif.png")
.\avifdec.exe $_.FullName $newName
}
The powershell script below convert images into png files and makes them square adding white padding. They can then be fed to an upscaler or other resizer to make them the correct resolution.
Note: you can delete everything below "#From here it is to square and fill the images" and use the script to only change the format of the image files.
#change jpg to png
Get-ChildItem -Recurse -Include *.jpg | Foreach-Object{
$newName=($_.FullName -replace '.jpg',"_from_jpg.png")
[void][System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
$bmp = new-object System.Drawing.Bitmap($_.FullName)
$bmp.Save($newName, "png")
}
#change jpeg to png
Get-ChildItem -Recurse -Include *.jpeg | Foreach-Object{
$newName=($_.FullName -replace '.jpeg',"_from_jpeg.png")
[void][System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
$bmp = new-object System.Drawing.Bitmap($_.FullName)
$bmp.Save($newName, "png")
}
#change bmp to png
Get-ChildItem -Recurse -Include *.bmp | Foreach-Object{
$newName=($_.FullName -replace '.bmp',"_from_bmp.png")
[void][System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
$bmp = new-object System.Drawing.Bitmap($_.FullName)
$bmp.Save($newName, "png")
}
#From here it is to square and fill the images.
$cnt=0
Get-ChildItem -Recurse -Include *.png | Foreach-Object{
$newName=$PSScriptRoot+"\resized"+$cnt.ToString().PadLeft(6,'0')+".png"
[void][System.Reflection.Assembly]::LoadWithPartialName("System.Drawing")
$bmp = [System.Drawing.Image]::FromFile($_.FullName)
if($bmp.Width -le $bmp.Height)
{
$canvasWidth = $bmp.Height
$canvasHeight = $bmp.Height
$OffsetX= [int] ($canvasWidth/2 - $bmp.Width/2)
$OffsetY=0
}
else
{
$canvasWidth = $bmp.Width
$canvasHeight = $bmp.Width
$OffsetX=0
$OffsetY=[int] ($canvasWidth/2 - $bmp.Height/2)
}
#Encoder parameter for image quality
$myEncoder = [System.Drawing.Imaging.Encoder]::Quality
$encoderParams = New-Object System.Drawing.Imaging.EncoderParameters(1)
$encoderParams.Param[0] = New-Object System.Drawing.Imaging.EncoderParameter($myEncoder, 100)
# get codec
$myImageCodecInfo = [System.Drawing.Imaging.ImageCodecInfo]::GetImageEncoders()|where {$_.MimeType -eq 'image/jpeg'}
#create resized bitmap
$bmpResized = New-Object System.Drawing.Bitmap($canvasWidth, $canvasHeight)
$graph = [System.Drawing.Graphics]::FromImage($bmpResized)
$graph.Clear([System.Drawing.Color]::White)
$graph.DrawImage($bmp,$OffsetX,$OffsetY , $bmp.Width, $bmp.Height)
#save to file
$bmpResized.Save($newName,$myImageCodecInfo, $($encoderParams))
$graph.Dispose()
$bmpResized.Dispose()
$bmp.Dispose()
$cnt++
}