Sign In

[Update: 08.12.23] Create perfect 100mb SDXL models for all concepts using 80gb VRAM - Vast.ai guide

[Update: 08.12.23] Create perfect 100mb SDXL models for all concepts using 80gb VRAM - Vast.ai guide

Sorry guys, currently got no money to continue the testing and training. Moving out next month. I'm just gonna post the current state of my new config here for you guys to play around with it. I find that the config exactly as it is produces better results than my old config already, with less training time, using only around 20-60 images per concept (depends on the concept). But it is still not perfect. Some concepts still don't train fully, but I find that increasing repeats/epochs/steps/whatever doesn't work out as well as I hoped. Anyway, feel free to test it out and play around with it. I hope that I can continue working on this in March or something.

Note: This config uses a new optimizer called "AdaCoor" developed by GoldKoron (partially using ChatGPT), based on work previously done by the late SargeZT. You need to download 3 files and replace the ones in your current library folder inside Kohya. You just have to put 1 as the LR and epsilon=1e-8 for optimizer args. But this is already done so in the config. Also, I have been told that if you use batch size values higher than 1 you need to change the LR value, but I haven't tested it.

Alternatively you can use the same config but use Lion instead, which I find also works very well, but a bit differently. Here I am using 1e-7 (cannot go higher or distortion happens). That's equivalent to AdamW with 1e-6. And of course no additional optimizer parameter.

I find that some concepts and styles work better with AdaCoor, some with Lion. Just try it out.

The new config(s) need "only" 48gb VRAM (so an A6000) which is a massive improvement. And SDXL finetuning isn't really possible anyway under 24gb without huge losses in quality. So for renting prices (and availability) that means no longer having to rent A100 80gb for 2€ an hour, but just an A6000 48gb for 0.4€ an hour.

Also I no longer extract and then resize. I have tested it and resize results in the same model as if you just extract at the lower dim directly. I have also found that 128 dim fp16 extraction results in the least amount of quality loss, with higher values not being worth it and lower resulting in too much quality loss imo. But you can go lower if you want. 128gb fp16 extraction results in a 1gb LoRa file which is massive of course but I strive for quality not efficiency. You can go lower if you want.

Again it is all experimental still.

I am posting the links in the beginning of this article as an edit.

Links:

Changelog:

  • 08.12.2023 - haven't had time to update the guide yet, but just be aware that I was wrong on Min SNR Gamma (5). It does indeed noticeably increase likeness and I recommend using it from now on.

    • As the guide strives to create the highest quality model, it requires 80gb VRAM aka renting an A100, since I turn off all the optimizations which I found to result in slightly lower model quality.

      • I will work on a guide for cost-effective model training with just 24gb and hopefully not that much worse quality in the near-future.

  • 05.12.2023 - replaced previous 80gb config with a new config titled "Skynet" (from now on each version will be named after a fictional AI, and my model versions will be named after the config I used as well)

    • major differences are no longer using EMA and instead turning off all other optimizations, and going back to 1 repeat 1 batch size, resulting in almost same likeness as before, but much more diversity and flexibility

    • I removed the 48gb config as I felt its quality was not good enough and still too harsh of a VRAM requirement for most people; I will in the future attempt to figure out a 24gb VRAM config that most people can use that provides good quality (though not as good quality as the 80gb config), but for now this guide will be only about the 80gb config - afterall this guide was always about attaining the best results possible, not the most cost- or VRAM-effective

    • I have not yet changed the actual article text to be in line with those changes, as well as some additional tips regarding outfits, but will do so sometime this week (I just simply hadn't had time for more yet)

  • 22.11.2023 - fixed a tiny but crucial error: I wrote that one should use a 1e-4 learning rate, but of course I meant a 1e-6 learning rate as per my configs!


Disclaimer

Financial support is very much appreciated, but please do so by donating to my Ko-Fi instead of sending me Buzz, as I currently have no uses for CivitAI Buzz!

My workflow depicted here requires a GPU with at least 48gb VRAM, but this guide still has a lot of knowledge which is relevant to all training, e.g. captioning and dataset creation and what each training setting does, so it is still useful to all! The guide also includes a vast.ai tutorial on how to rent and set up the required A6000 GPU.


Introduction

Throughout my model training 'career' I have been getting many questions on how I train my models. So one day I wrote this guide on how I train my models. But since then I have trained so many more models and learned so much more on how to create even better models that it was time for a completely new guide (also I never finished that Vast.ai section). My approach to model training has radically shifted over the past 2 months and is reflected in the much higher quality of my new models compared to the old.

I no longer do straight training of LoRa's, and instead first do a full-finetune and then extract and resize a LoRa, as I find this to not only result in higher quality but also fewer parameters that could worsen the training results (e.g. no more DIM and Alpha values to play around with).

After having put this project on hold for so long because I have ADD because I am lazy because I was busy, it is finally here (including a better Vast.ai section!)!

I share this knowledge completely for free, because as a leftist I believe in the free sharing of such knowledge and because I have myself gathered a lot of this knowledge for free by asking many people many questions (see also the 'Shoutouts' section), so it would be morally wrong of me to put all that behind a paywall now.

Still, donations are very appreciated and very helpful as I'm broke again.


What this guide won't teach you

This guide is not a beginners guide. It will not teach you the basics of what model or LoRa training even is or why we even do it in the first place. It will assume you already know some basic stuff.

It will not teach you the basic differences between various model types like LoRa's vs. full finetunes vs. Hypernetworks, etc... and why people choose one over the other.

You will not learn how to do straight LoRa training.

It will not teach you how to train on 1.5 or 2.1 or other SD model architectures.

It will not teach you any theory, because I found that theory and practice diverge quite a bit in this space (also I am not a machine learning engineer). There are so many settings where theory says it should do X, but my long-time testing showed that it actually does Y. There are many people who swear on the theory and will train their models according to it, but I am not one of them.

This guide will not include any tests or comparisons between settings, as creating those tests and comparisons would cost me far too much in time and money to do and would bloat this guide up to the moon.

It will not teach you the fastest and most cost- and time-effective way of model training. My way of training requires either a A6000 48gb or A100 80gb (depending on which config you choose, with minimal differences between them, but more on that later) and definitely cannot be done on a GPU with 24gb VRAM.

If you seek guides that will teach you any of the above, this is not one of them. There are many other guides out there, be that on CivitAI, Reddit, or YouTube, that will teach you that stuff.

However, much of the stuff taught in this guide will probably still be relevant to you and help you in your endeavours.


What this guide will teach you

You will learn how to train high-quality SDXL 1.0 LoRa models with a small file size of just 100mb, great flexibility, low amount of bias, and high image quality - on either an A6000 GPU with 48gb VRAM or a A100 GPU with 80gb VRAM.

You will learn how to first do a full-finetune and then extract a LoRa from it and resize it.

You will learn how to build good datasets and where to get the data from in the first place.

You will learn how to caption it well.

You will learn how to setup a GPU on Vast.ai for training.

You will learn what effect each most of the training settings have on training.

You will learn how to evaluate your final model and how to best setup your model page on CivitAI (by my subjective opinion).


My 'credentials' and history

As I have already stated above, I am not a machine learning engineer. I am not a computer scientist either. Hell I don't even have a job in the IT industry.

I am just a 26 years old German low-level bureaucrat who is addicted very dedicated in his pursuit of the optimal model training workflow.

I started with AI image generation as a hobby in April 2022, when I first joined the Midjourney discord. Then in July 2022 I joined the DALL-E 2 beta. Even back then I was already addicted very dedicated and had spent up to 50€ in credits on just one image, inpainting over and over again to achieve a (for the standards back then) perfect result. Remnants of that time can still be found on my Instagram account, like this post here.

I don't remember how or why I got into StableDiffusion, but it happened sometime in August 2022 I think. Back then I swore I will not use one of those custom UI's as that was too complicated for me. Lol.

Then a month or so later the first GitHub repositories for training SD became popular such as JoePenna's and Shivam's. I joined the JoePenna discord and started training my first models (of course my first model was a Korra character model). Oh how bad these early models were and how we were not aware of that fact. Lol.

Since then I have been switching between many different model types (full-finetunes, LoRa's, LoCon's, etc...), architectures (1.4, 1.5, 2.1, XL), training repositories (JoePenna, Shivam, Everdream (Freon), Kohya) and what not, always in pursuit of the perfect model. I have been training probably a thousand test models up to this point. All my spare money in a month would (and still does) go towards the renting of GPU's on Vast.ai to further my testing. At the end of every month I would essentially be broke and still am.

So I may not have the theoretical knowledge base for training models (albeit at this point I know the basics) or writing my own scripts (thanks ChatGPT) or code. But I have a ton of practical experience training many different models and testing many different settings and caption types and datasets, repeatedly.

But ultimately this is all kinda meaningless since I can just say whatever here and just lie. So in the end I encourage you to check out my most recently uploaded models and judge their quality and the trustworthiness of my words for yourself.


Shoutouts

As I have written above, I came into this space with absolutely no knowledge about anything. Not even basic Linux usage, as I am a Windows user. So naturally I had to ask a lot of questions, to a lot of people. So without them I would have never come this far. Not to mention all the people providing the tools I use for free. So it is just fair to give them a shoutout here. This will not be a comprehensive list, that would take far too long and I don't even remember every person on my journey.

  • SargeZT - few people knew him, which is all the more reason to put him at the top here. He was never that interested in model training itself, but he was very dedicated to doing novel experiments such as introducing v-prediction to 1.5, building the "DoWG" optimizer, and some other stuff I don't remember. Unfortunately he unexpectedly passed away in September at a far too young age, and I have to hold back the tears writing this.

  • Kohya - for providing his repo and answering a lot of questions (and most recently introducing his own superior version of high-res fix!)

  • JoePenna and everyone else in the JoePenna discord - for starting me on this journey with his repo - one of the first in this space - and answering a lot of questions

  • Freon and everyone else in the Everdream discord - for providing his repo and answering a lot of questions

  • Comfy - for providing a great resource-saving UI and answering questions

  • bmaltais - for providing the much used GUI fork to Kohya and answering questions (and even implementing some suggestions/fixes by me)

  • Everyone else I have ever asked a question or who has shared his knowledge or resource with this space for free


Dataset

Every good model begins with a good dataset and issues here will trickle down to your final model output. If your dataset is big enough and bad images make up only a small portion of your dataset and you caption them accordingly then this isn't as much of an issue, but the fewer images you have the more the quality of each image matters.

I generally use ArtStation, DeviantArt, Pinterest, Reddit, Twitter, Instagram, and Fancaps.net to find images for my datasets. ArtStation, DeviantArt and Pintered are good for fanart and concept art and just generally normal art, while Reddit sometimes has a few pieces unique to it too. Twitter also has a lot of art posted to it. Instagram also has art posted to it, but I find that it is most useful for finding high-quality cosplay photos which often do not find their way to other websites. Fancaps.net is a good website to find high-quality screencaps of various shows and movies, but it doesn't have screencaps of every show.

For Pinterest you need an account for it to be usable, and for Instagram you need some extension or website that allows you to download the images posted on there. I just use this website.

I don't ever use autoscrapers as I find it important to search for and download images manually to make sure you have exactly what you want at the best quality possible.

You need at least 20 images for a concept (character, style, doesn't matter) to achieve adequate results, but going up to around 150 images can result in better results still. Technically you can train on even just 7 or even 1 image, but results will not be as good and there are other guides for that. Past 150 images I find that there is almost no difference in results.

It depends a lot on the concept you want to train how many images you need. Some styles and characters are much easier trained than others. Introducing truly novel concepts to SD, like my (not-yet-updated to my new standards) morphing LoRa, takes a ton of images for good and consistent results. But that is unique to them since SD has no real starting point with them. Also generally the more realistic a style is the easier I find it to train. So anime-adjacent styles are at the end of the spectrum and typically need more images. It also depends on your training images of course. For example my How To Train Your Dragon LoRa was hard to train and needed more images because there are very few shots of just one person or one building in it. Most images I have are chaotic shots of groups of people (and dragons). SD has a harder time working with images where a lot is going on in them than concept art of a single character. So take that into account when choosing which images to use for training.

I would generally advise against adding more images just because. That just increases training time for no reason and can actually make the model worse, as every image you add makes the model more likely to output something related to your dataset, as opposed to taking the knowledge from SD and just applying the concept to it. E.g. if you include an image of a car in the Ghibli style in your dataset and that car is an oldtimer (as is often the case in Ghibli movies), the model will be more likely to output oldtimer cars now. Whereas if you hadn't done that, the model will pick what SD knows as a car and just apply the Ghibli style to it. This isn't completely true however as the model seems to create some associations of their own sometimes. E.g. Nausicaä does not have cars in it (obviously) and neither does the dataset because of that, and yet the model skews a bit towards oldtimers. Probably because SD picks up on it being a Ghibli-looking style and SD already has some associations with Ghibli in its model, including said oldtimer cars.

I recommend including as many different styles of your concepts as possible (obviously not relevant for style models lol), so for characters include cosplay photos, fanart, screencaps, etc... This will greatly increase the flexibility of the model of portraying your concept in a different style. I find that the more a concept leans towards an anime or cartoonish style the more inflexible it will be without such images. For some reason it is extremely easy to portray a photographic celebrity in anime form, but not an anime character in photo form. Keep the ratios between the different styles roughly equal, as even if you include cosplay photos, if you have 20 times the anime screencaps in your dataset it may still fail at the style flexibility.

Also, I recommend including as many different POV's as possible, especially for characters, but styles too! E.g. facial closeups, medium-shots, and full-length body-shots! If you use only facial closeups for example, producing a full-length bodyshot will be hard.

Your dataset's composition should be consistent. This is something that will come up in the caption section as well, but I noticed that when it comes to datasets and captions, model training is like working with venn diagrams. The closer your images are in looks, the easier it is to train (and overtrain!), and the further apart they are, the harder it is to train (and overtrain!). SD is very good at picking up subtitles and seeing images as being closer together or further apart in concepts based on small changes, like a closeup vs. a full-length shot. This isn't so relevant to style, but you notice it quite a bit when training characters and this is why it is hard to portray anime characters in a different style even when including cosplay photos, because SD likes to think those cosplay photos are some other concept entirely even if you captioned appropiately. It is also why training a character on 20 anime screencaps will probably yield better likeness results than training it on 5 fanart, 5 cosplay photos, and 10 screencaps. But the latter will be more flexibility regarding style.

Generally I recommend having images of the following in your dataset if you are training styles, with the ratios between them mostly balanced:

  • animals, insects, monsters, creatures

  • landscapes (different day times)

  • exterior scenes (different day times)

  • indoor scenes (different day times)

  • objects

  • men - young (closeup, medium-shot, full-length)

  • women - young (closeup, medium-shot, full-length)

  • men - middle-age (closeup, medium-shot, full-length)

  • women - middle-age (closeup, medium-shot, full-length)

  • men - old (closeup, medium-shot, full-length)

  • women - old (closeup, medium-shot, full-length)

  • land vehicle, water vehicle, air vehicle

  • fire, smoke

  • water, underwater

  • plants, flowers

  • ground

  • sand, dirt, stone

  • eyes

  • sky

  • ruins, ruined, abandoned, overgrown, rusted

  • building

  • windows, transparent, glass, reflective

  • electronic screens

  • letters, paper, signs

  • + anything else that might be useful and unique and isn’t included above yet

Of course, you will almost never have a dataset that has images of each, with perfect ratios between them. This is just the ideal dataset you should strive towards to make your model as good as possible. You will never actually achieve it.

When you have accumulated the dataset, there is no need of cropping them to a particular resolution, as all modern trainers nowadays support aspect ratio bucketing, which will take care of this issue for you with no cropping whatsoever.

You should however make sure to have them all be in at least a 1024x1024 or equivalent aspect ratio resolution, since SDXL works with that resolution. If they are smaller, I recommend manually upscaling them. Kohya can do so automatically, but I find doing it manually results in better quality. I tested quite a few upscalers and found best results were to be had with:

Also, as a finishing note:
DON'T USE REGULARIZATION IMAGES!

(aka images that depict a random concept, often generated from the model itself, that act to "preserve" the knowledge of the model so that it isn't overtrained)

So many people recommend them, but all they do is act as a learning dampener to your training, in which case you can also just decrease the LR, at no benefit. None of my models have been trained with regularization images and as you can see they work just fine. Regularization images are a waste of time.


Captions

One might think captions cannot be so important, just caption the image in some general way that makes sense, but captions can make or break your training and SD picks up on subtle differences in your captions very hard.

I used to subscribe to the idea that you want your captions as detailed as possible. But I have learned now that the best way is actually to describe as little as possible.

For a long time I had the problem that SD was too good at picking up the captions, and would often during inference give me back images that were similar to my training images. Now you will scream that this is overtraining, but dampening training would just result in worse likeness, while still having said issue.

And I do not recommend training without captions.

So instead I am now describing only the most important things. As much as is needed to get the likeness I want, and thats it. So for example I no longer describe POVs (headshot, medium-shot, full-body), or styles (unless its a style LoRa ofc), or backgrounds. I found that if I do that, prompting those things will pull up something similar to the training images. However if I do not describe that, SD will fill in the blanks, and also if your dataset is diverse enough it won't be an issue anyway. Since say if you have a closeup, a half-body, and a full-body shot of your character and you don't caption that, SD will pick up on the character being portrayable in all 3 of those shots, since the dataset includes those.

I found that for styles going so simple as just "anime screencap in lkst artstyle of a scene" is already enough to get likeness, if you have enough images. And calling back to the venn diagram theory, you now have all those diverse images inside the same venn diagram "a scene", making training easier while also decreasing bias, since "a scene" has a lot of diverse images behind it. Meanwhile with my old caption method each image would sorta become its own venn diagram, thus being biased towards that images.

Also, the more abstract or surreal a concept is the less you should describe about it, or even just straight up caption it as "abstract" or "surreal".

However you still want to describe everything that you:

  • Want to be able to specifically prompt for later

  • Do not want to be an inherent part of a generation unless specifically prompted for

So that includes mostly watermarks, text, outfits and hairstyles. I also add '8k' and 'poorly drawn/shot' for various quality levels of the images, but that has little effect on training I feel like and is more of an internal division for me. I use those terms instead of high-quality or low-quality, because during inference I found that those terms actually result in higher or lower quality images more consistently than the latter terms (which is also why they are in the recommended positive and negative prompts on my model pages).

I also added specific words to each style caption, like "viking" for HTTYD, or "asian" for Korra, to make sure the model is not getting too biased towards viking looking output for HTTYD or asian looking output for Korra. I never really tested it, and I don't think it actually works, so I don't recommend doing it, but I am just writing it here in case you are wondering why I have that in my captions.

I caption my concepts using a combination of a 'rare' token and a 'known' token. These are the trigger words which should appear in every image and are used to trigger your LoRa. This is a good way to contain the concept you are training to those words, and if they aren't being used the LoRa has essentially no effect.

Just think of tokens as words and letters that SD uses internally to understand your prompt. Rare in this case meaning a letter or word or combination thereof that in SD does not already have an associated meaning. E.g. if you prompt 'car' in SD, you will get images of cars. If you prompt 'painting', you will get paintings. A rare token like 'ewpp' meanwhile will just spew out any random incoherent mess of an image. I use rare tokens, because I found that using tokens with already established meanings in SD to negatively affect my training. E.g. when I captioned my (no longer available) Nausicaä character LoRa with 'nausicaa' I was completely unable to portray her in a photographic style, because the nausicaa token inside SD is heavily overtrained on a cartoonish style. Meanwhile using the 'ncpp' token helped a lot with combatting that. However, as a trade off you start off with no knowledge base to draw from during training when using rare tokens, and I found that during inference the model can often still struggle correctly identifying your concept. E.g. when I used only "ewpp" for my Emma Watson model, it would at times output a random car in her place or a man.

This is why I now pair my rare tokens with a known token. E.g. "femme ewpp" in the case of Emma Watson, or "anime screencap in lkst artstyle" for the Legend of Korra artstyle. This seems to work well and cover both positives. I found that using femme instead of woman for my female celebrity and character models resulted in better likeness. My theory here is because woman already has such a strong and diverse meaning in SD and as such your character likeness has more to fight against, while femme while still being associated with women is much more malleable. Keep this in mind when trying to find optimal tokens to use during training.

The position of the rare token inside your caption matters a lot. Putting the rare token "kncr" at the beginning of the caption, like "kncr, a woman with purple hair wearing futuristic armor", will mean that the token is associated with the entire image and will be diluted to the point of not working correctly. Putting it next to the outfit instead, like "a woman wearing kncr outfit" will instead associate that token with the entire look of the character, including the hair, and work much better. If you instead caption it as "a woman with purple hair wearing kncr outfit" it will associate the rare token with the outfit - minus the hairstyle. This will enable you to prompt the outfit without the hairstyle alongside it - if you can also provide some counter examples of the outfit without the hairstyle. Likewise, captioning it as "a woman with kncr hairstyle wearing futuristic armor" will have the opposite effect and associate the token with the hairstyle, not the outfit.

I also do not recommend tag style captions like Danbooru uses. SD was trained on natural language captions, and I have myself found results to be better with those.

Lastly, you want to caption all irregularities within an outfit or person. Say you have an outfit that generally has no helmet, but in this shot the character is wearing a helmet, so you simply caption it as “wearing xxcc outfit with a helmet”. Also, I do not assign tokens to generic one-time outfits, like say a piece of fanart of an anime character wearing a tshirt.

I use filenames as my captions, as I find them easier to edit using "Bulk File Rename" (a paid tool for Windows 10). But the Kohya trainer wants caption's in the .txt file format. So I use a simple Python script to convert the filenames to .txt files (I had ChatGPT generate that script for me). You can find it here.

Here are some screenshots of some of my caption folders for you to use as a guideline:

https://imgur.com/a/H2XJhjB


Vast.ai

I always rent GPU's on Vast.ai (I am not being paid by them to advertise them, god I wish) for training. This is because I only have a 3070 8gb at home, and because I want to use my PC for things other than training. If I were to buy a 4090 now and use my PC for training, well, I couldn't do much else while training is going on. And since I train almost non-stop, well...

Renting GPUs also allows me more flexibility. E.g. I recently switched from straight LoRa training where I could use just a 3090 for, to full-finetuning, which requires 48gb VRAM at the very least.

This imgur link contains numbered images showcasing the Vast.ai and Jupyter-notebook interfaces and various steps to take to set the GPU up.

Keep in mind that you can also use Runpod.io as an alternative to Vast.ai, and much of the same principles described here apply there as well.

I will explain each step here:

  1. This is what you see when you go to Vast.ai. Click on the 'Console' button in the top right corner.

  2. You will be greeted by this screen. First go to "Billing" on the left and then add a credit card and load up some credits. If you have no credit card that is no problem. I am myself using a virtual credit card called 'Revolut'. It basically works like Paypal. You connect your bank account to it and verify yourself using some online service, and then you do a bank transfer from your bank account to Revolut. There are also tutorials about it on YouTube. Once you have credits, go back to the 'Search' screen (on the left).

  3. This bar is how you sort for available GPU's. #GPU's is self-explanatory, On-Demand you don't want to change or else your instance can be interrupted at any point, A100 SMX4 is the GPU currently selected, Planet Earth means it searches for GPU's around the globe, Price means you are sorting by price (increasing).

  4. Before you select a GPU to rent, you must first setup the instance configuration. You do that on the top left corner. Select how much disk space to rent, then click on "Edit Image and Configuration".

  5. Go to the recommended tab, and click on the "edit" button of the "Kohya_ss" template at the bottom of the page.

  6. Select the most up-to-date version in the dropdown menu, and leave everything else as default. This template will load a docker that contains the GUI version of Kohya. Only add -p 6006:6006 if you later want to use tensorboard, but this guide won't.

  7. This should already be set by default, but in case it isn't, set it up like that. Once you are doone, click on "select and save" at the bottom of the page. You can later just use this saved template from the "recent" tab.

  8. Now rent your GPU in the "search" screen and wait until it is set up and looks like this. Keep in mind that there are two A100 versions. The cheaper ones have only 40gb VRAM.

  9. Click on the small orange button at the bottom right corner to connect to the Jupyter interface.

  10. This is the screen you will be greeted by.

  11. Go into the Kohya_ss folder.

  12. Go into the dataset folder and upload your training images. Once that is done, click on "New" in the top right and create a new folder called "pretrained". This is where we will download the model into that we will train on. For the model training you can also just load the diffusers version from Huggingface, but we will need the model locally later anyways because we need it to extract the LoRa. So make sure you have the base SDXL model uploaded somewhere (faster) or alternatively just drag and drop it into this folder (much slower). If you do it the latter way you can just ignore steps 13-15.

  13. Once inside the pretrained folder, rightclick and create a new "Notebook" file or...

  14. ... do soo by the method previously described.

  15. Insert the following command

    import urllib.request

    url = 'YOURLINKHERE'

    filename = 'sd_xl_base_1.0_0.9vae.safetensors'

    urllib.request.urlretrieve(url, filename)

Replace YOURLINKHERE with the link to your model that you have uploaded somewhere. And replace filename with the filename of the file. Or alternatively just use my link. I have uploaded the SDXL 1.0 with 0.9 vae model to my dropbox.

  1. Now open a terminal window using the method already described.

  2. As described in the caption section, the filenames still need to be turned into .txt files for the trainer to work with. So we use my script, also linked in the caption section, to do so. Upload that script into the same folder where you uploaded your dataset too. Then run this command:

cd kohya_ss/dataset/yourdatasetfolderhere

python create_txt_from_images.py

(cd ... is the command to go into a folder)

(python ... is the command to execute a python file and the stuff behind that is the filename)

  1. Open another terminal window, and execute the following command:

cd kohya_ss

source ./venv/bin/activate

pip install scipy

./gui.sh --share --headless

This will install a missing dependency (scipy) and activate the venv and start the GUI. Once the command is fully executed there will be two links in the terminal window. One is the link for the local GUI (not needed), the second is the one to access the GUI remotely, which is the one we need. Just click on it to open the GUI in a new tab.

DONT EVER CLOSE THIS TERMINAL WINDOW

  1. This is how you download your model file once it's done. Just right-click and click on download.

  1. (I fucked the numbering up) If you ever need to delete something, do it this way AND NO OOTHER WAY. Just right-clicking a file or folder and clicking on delete will not actually delete anything. To truly delete something you need to open a terminal window and execute this command:

cd your path (must be 1 subfolder above the folder you want to delete)

rm -rf insertfileorfoldertodeletehere

I like to just create a new folder called 1 and then just put everything I want to delete in their and delete that folder (so 'rm -rf 1').

  1. Once you have opened the GUI you will be greeted by this screen. Go to the finetuning tab.

  2. In the source model tab, set it up as shown in the screenshot to load the locally uploaded SD model for training, or...

  3. Just select the huggingface model. It will be downloaded once training starts.

  4. The folder tab should be self-explanatory. Config, output, and logging folders are where the model related stuff is saved to (I have it all be the same folder), training images is the folder you load your images from, and model output name is the filename your final model file will have.

  5. In the dataset preparation tab just set it up like I have (this should mostly be the default settings anyway). This just sets up the thing to turn your training images into latent files for training, as well as load the captions. Just have the resolution be 1024x1024 as that is SDXL's resolution, have the min and max bucket resolutions be as shown to cover every possible range, then the rest should just be left at default.

  6. In the parameters tab, if you want to use one of my two configs, you can just load it here as a preset. But to do so you first need to upload my config file to /workspace/kohya_ss/presets/finetune/user_presets/ BEFORE starting up the GUI. Otherwise just set the parameters manually.

The two config files can be found here and here.

  1. This will be explained in detail in the next section of the guide.

  2. This will be explained in detail in the next section of the guide. If you want to use samples, you can set them up in the next tab. How to do that is explained there.

  3. Self-explanatory.

  4. Once the full-finetune training is finished we need to extract the LoRa. My given full-finetune parameters should be good enough for almost all datasets and concepts that you do not need to change the dim values here (nor in the resizing step). If you used other finetuning parameters you may need to adjust these values. Otherwise just set the folder paths (with the base model path being our uploaded model from step 15), leave dim at 512/512, check SDXL obviously, leave clamp quantile at 1, save precision I leave at fp16 but during my testing there was no functional difference between the three values, and add four 0's to the minimum difference to make certain that the text encoder is also being extracted, as the default value here is almost always too low and will result in no text encoder being extracted. Then just click on the button at the bottom and wait (usually takes around 5-10min).

  5. Setup the folder paths with the source LoRa being the LoRa we just extracted, the desired LoRa rank at 16 (again, using my full-finetune parameters this should work splendidly, otherwise you may need too experiment with this value - with higher values being more "trained" but higher filesize), dynamic method being none (I found no difference between the methods so I just stick with none as more variables can introduce more problems), save precision doesn't matter again (again just use fp16), and dynamic parameter leave at the default.

IMPORTANT: If you use my A100 80gb config, it will use EMA. EMA increases VRAM consumption by a ton (hence the A100) while providing a minor benefit to backgrounds (so its only for perfectionists like me). But in order to use it you need to replace three files in your folder structure as the default Kohya repository does not support it yet. Be sure to make a backup of each file beforehand in case you get errors or want to switch back to the non-EMA version of training!

Put this into the main kohya_ss folder.

Put this and this into the library subfolder.

Then just add --enable_ema --ema_decay=0.9995 to the additional parameters section as per my example config.


Training parameters

This section will explain most of the training parameters found in image 28 and 29 of this imgur link.

I will not explain the theory behind them or write large paragraphs about each, but rather just simply describe what effect I found they have on training and what you should set them to.

If you are not using my provided config, use this section to guide yourself towards creating your own config.

The two config files can be found here and here. The A100 80gb config has slightly better quality results than the A6000 config file, but the difference is only noticable to perfectionists. Stick to the A6000 config for much more cost-effective training (also A6000s are much more readily available than A100s) with almost the same quality.

  • Train batch size

This increases VRAM consumption considerably, while training multiple images at once if set higher than 1. This decreases the it/s, but ultimately results in a training speed buff. However, apart from the cost in VRAM, I found that this also dampens learning by quite a bit, so if you set this too high you will undertrain and you will need to increase other parameters to compensate (such as learning rate). I find that any values higher than 3 result in only very very small gains in training speed. 3 is an optimum between speed increase, VRAM consumption, and learning dampening.

  • Epoch

This is bugged. Dont use it. Use the next one, Max Train Epoch, instead.

  • Max Train Epoch

For how many epochs you will train. Use this instead of steps, as an epoch is always a full run through the dataset while steps aren't (usually) and several other parameters use epochs instead of steps.

I always train to full 100 epochs and use other parameters to affect how the model behaves. There is also a weird bug where past 100 epochs the model barely trains no matter what. Very weird. Hence I just stick to 100.

  • Max Train Steps

Don't use it.

  • Save Every N Epochs

How often a model file should be saved. Be sure you have enough disk space for this. You can leave it at 100 to only save the last model file (if you train for 100 epochs that is).

  • Caption extension

.txt obviously.

  • Mixed precision and save precision

This is how "accurate" the training and saving is being done. Theoretically higher values should be better here in terms of quality, but I found FP16 to be better than both BF16 and FP32/float. The save precision value used also matters for how large the model file is at the end. Also I think FP32 used more VRAM but I don't quite remember anymore. I find that FP16 produces the best results for both here.

  • Number of CPU threads

Leave it at default. Doesn't have any noticable effect.

  • Seed

Input a fixed seed if you want to compare different model training runs, otherwise leave it at random.

  • Cache latents (to disk)

Means it will save the latents to be used for another model training run so that they don't have too be regenerated. Leave it off, as with it on you cannot use random-cropping (more on that later).

  • LR scheduler

Cosine means it will follow a curve and slowly decrease. Constant means its just a straight line. Polynomial and such are basically just more advanced versions of the above. I find that cosine results are equivalent to just using a lower LR at constant, so just stick with constant. If you feel the need too use cosine you can also just decrease the LR altogether and results will be the same. The same holds true for any of the restart schedulers.

  • Optimizer

This is gonna be a controversial one, but I feel like my model results speak for themselves.

Don't use any of the adaptive optimizers. Contrary to what is often claimed they don't work. Per theory they should adapt during the run and increase and decrease whenever needed. However if you just do the most minimal amount of testing with a tensorboard graph you will notice that the only thing these optimizers ever do is increase the LR constantly until the training run is finished. They never decrease it. Only if you use a cosine scheduler, but that decrease then just naturally comes from said cosine scheduler, not from any adapting done on the part of the adaptive optimizers.

So basically, the adaptive optimizers are equivalent to using a non-adaptive optimizer and just constantly increasing the LR every few epochs. And this is bad. Because it leads to model degradation and overtraining as the LR becomes too high.

Also adaptive optimizers are very VRAM hungry.

As for the standard non-adaptive optimizers, AdamW produces slightly better results than AdamW8bit but uses more VRAM. And AdamW8bit produces slightly better results than PagedAdamw8bit but uses more VRAM. But more importantly, PagedAdamw8bit is slow as fuck so dont use it. Like twice as slow.

Lion trains extremely intensely and will very quickly overtrain. I never found it useful unfortunately.

As for the other optimizers I haven't tried them but nobody seems to be using them anyway (I think they are all just more versions of adaptive optimizers).

  • Learning rate

I found that any values (yes even just small values) higher than 1e-4 1e-6 or 0.0001 0.000001 result in severe model degradation. Like worse than overtraining. Not sure why that is. But you shouldn't ever need a value higher than 1e-4 1e-6 anyway, so just stick to that. Similarly lower values just undertrain.

Generally the default values of 1e-4 1e-6 with AdamW(8bit) on a constant scheduler works best imho.

  • LR scheduler extra arguments and Optimizer extra arguments

This is just for those schedulers and optimizers that need it. Since I only use AdamW on constant it isn't relevant to me. If you use any other optimizer or scheduler you have to do your own research on what to input here.

  • LR warmup

Just like cosine this one is just the equivalent of using a constant LR at a lower value. So just don't use it.

  • Cache text encoder outputs

I don't recommend turning this on unless you seriously want to save VRAM and don't care what impacts it might have.

  • No half VAE

This fixes a bug and needs too be turned on for all model training.

  • Dataset repeats

How often your training is being repeated essentially. So a 2 here means I basically train for 200 epochs instead of 100. However, I have found that 2 repeats 100 epochs produces different results than 200 epochs. Probably because of the aforementioned training bug with training past 100 epochs. So it is not a 1 to 1 comparison. I found that putting it at 2 and using a batch size of 3 (gradient accumulation technically but we will get to that soon) and 100 epochs produces good results. Hence its at 2 here.

  • Train text encoder

I dont recommend ever turning this off. Results are just so much worse without it.

  • Gradient accumulation

This basically works like batch size, but uses slightly less VRAM. It also seems to dampen training a bit more than batch size. I prefer using this over batch size.

  • Block LR

Complicated advanced stuff I don't use but other people are experimenting with, so I cannot comment on it and unless you know what you are doing, leave it empty.

  • Additional parameters

Any additional parameters the model may know but aren't listed here, such as the EMA ones I introduced by changing the files. For more info on EMA see the end of the Vast.ai section.

  • Save ... steps

Like Save every N epochs but for steps.

  • Keep n tokens

I think this is for when you use more than 75 tokens in your captions? Not sure. I leave it off.

  • Clip skip

Only needed when training on a model with a non-default clip skip value, such as the leaked Novel AI model or derivates thereof in 1.5 SD. Leave it at 1 unless you know what you are doing.

  • Max token length

I leave it at 75 as my captions are short. If you have .txt files with a bazillion tags, you may want to increase this. But I cannot comment on the efficiency of this as I don't train that way.

  • Full fp16 training and full bf16 training

The former needs the optimizer to be set to AdamW8bit or else you get NaN errors, the latter can use AdamW and maybe any? Not sure. Didn't test that extensively. Either way both reduce VRAM usage considerably and make training a bit faster, but both have horrible quality downsides so I don't recommend using them at all. If you use them, make sure your mixed and save precisions are set at fp16 or bf16 respectively.

  • Gradient checkpoining

Saves a ton of VRAM, but makes training slightly slower and makes quality slightly worse. I try to avoid using it unless I have to.

  • Shuffle caption

If you use captions with comma separated tags this will shuffle the tags around. Haven't noticed a huge difference there when I once tried tag style captions.

  • Persistent data loader and Memory Efficient Attention

No idea what that does. Just leave it off.

  • CrossAttention

Basically works like Gradient checkpointing. Saves VRAM, but at the slight cost of quality. I rather use this than gradient checkpointing, but it saves less VRAM than gradient checkpointing. However it has no speed decrease. Xformers is for Nvidia GPU's, the other for AMD I think.

  • Color and flip augmentation

Slightly changes the color of your training images or flips them around, resulting in essentially new images (for the model). This is basically an artificial means of increasing the dataset without overtraining, as the images are slightly changed and thus "new". Generally I don't recommend such artificial means of increasing dataset sizes.

  • Min SNR gamma

I tested this a lot. Seems to make results slightly worse. I just leave it off at this point.

  • Don't upscale bucket resolution

If left unchecked will upscale images that are smaller than the training resolution (in this case 1024x1024). But as previously described you should upscale them manually.

  • Bucket resolution steps

Leave it at this default.

  • Random crop instead of center crop

If latents caching is unchecked and aspect ratio bucketing is used (which it should always) this will add "crop jitter" which is another artificial dataset augmentation where images that are bigger than the training resolution are randomly cropped to the training resolution. This is randomized every step. Unlike the other augmentations this one seems fine. I can't say I have experienced any different results from using it, but it certainly didn't make things worse.

  • V pred like loss

Some experimental stuff where bmaltais told me he tested it and it produced bad results. So don't use it.

  • min timestep and max timestep

I tried experimenting with this but anything other than the default values results in worse results so leave it at the defaults.

  • Noise offset, Noise offset type, Adaptive noise scale

When enabled will make the lighting much more accurate, but I found that it makes overall results much worse. Stability AI seems to have come to the same conclusion as after testing it they decided to train the final SDXL version without (hence the noise offset LoRa released by them).

  • dropout caption every n epochs

Will train without any captions every n epochs. As I find training without them worse, I don't use it.

  • Rate of caption dropout

Same as above, but a percentage value applied every epoch. E.g. 0.05 means 5% of images (randomized) will be trained without a caption each epoch.

  • VAE Batch size

No idea, never tested it.

  • Save training state

For better resumption of training. But I find that resuming training doesn't ever work well anyways so I dont use it (also costs more in storage space I think).

  • Max num workers for dataloader

Same as number of CPU threads, leave it be.

  • WANDB API key and logging

If you want to use the website Weights and Balances during training. Basically a better tensorboard.

  • Scale v predicition loss

Ignore. Only relevant for 2.0 models.


Evaluating your model and setting up your CivitAI model page correctly

I generally recommend prompting one or multiple of your captions 1 to 1. If it outputs images similar to your training images in pose, style, likeness, etc… it very likely indicates overtraining. In addition to that, I recommend to also prompt for generic stuff like people, objects, landscapes, etc… and to do so in batches of 4. Also prompt for stuff that is similar to the stuff depicted in your training images.

Many people seem to misunderstand what undertraining looks like. E.g. when they are unable to portray the character in a different style they assume it is overtraining to the main style. However this can actually also be a sign of undertraining, as the unet has already learned the style likeness, but the text encoder has not yet associated the style with your style token, so it just assumed the style to be part of your character, as most of the training images are associated with that style. I found that later epochs or trainings with higher learning rates can actually be more flexible then. This is of course only true up until the point where it starts overtraining.

Regarding overtraining I find that it can also actually reduce likeness as it wildly jumps around different points of the training instead of finding a sweetspot.

Regarding your CivitAI page I find that it is very important to always mention the generation parameters you used for the sample generation and which you find best to use with your model. You should also always say what the trigger word is, even if it is already mentioned in the sidebar, as many people don't look there.

I like to give a short model description and put it at the top so that if the model link is posted somewhere the thumbnail (like on Discord) shows the model description first.

I have a section "further usage notes" in case a model has specific problems that should be addressed. Like a bias towards a certain thing.

I restate the usage permissions as I find that the small buttons that explain them are easily overseen and not clear enough.

I also like to give a TL:DR of my training workflow and a changelog of what has changed in the latest update to the model. I do however prune this whenever there is a major update.


Additional questions

  • Why I don't train on 1.5

I find SDXL to always train much closer to my training images than 1.5. It also has a bunch of other improvements over 1.5. Hence I abandoned 1.5 completely now and stick to SDXL.

  • Why I don't train on finetunes

I find base SDXL to be perfectly adequate for training. I do not want the potential inherent biases or issues of another finetune to negatively impact my training.

  • Loss is useless

I find the loss graphs in the tensorboard to not correlate with how training is going at all. Ever. Test your epochs. Ignore the loss.

  • Where are the different parameters for styles and outfits and...

You do not need different parameters for different concepts. Styles, outfits, characters, etc... all train the same. The differences come from the individual concepts itself (e.g. training Emma Watson vs. training Joe Biden), not what "type" they belong to.


Ending notes

Congratulations! Assuming you used my config and followed my caption and dataset advice, you should now have a model that is only 100mb big, has great flexibility, little bias, and great likeness and can be attached to any other full-finetune!

At the very least you should have the knowledge now to develop your own training config.


132

Comments