This is an overview of my process for training LoRA models for furry characters. Some things to note:
This guide is a couple months old at the time of writing, so some parts may be out of date.
I haven’t really tried training concepts or art styles, so this may not be applicable for those.
I don’t intend for this to be a comprehensive training guide, so for the technical stuff you’ll be on your own to figure out what to install and how to use it. I have no experience with Google Colab; I do everything on my local machine.
I use more or less the same tools that were available around the time I first started training models. I’m sure newer/better/easier tooling has emerged since then, like the Kohya GUI, but I don’t have any experience with it so you may need to adapt these steps to your particular setup.
So far the following steps have been “good enough” for me, but I’m sure others can point out ways I could improve the process or my training settings.
Step 1: Character and material research
The first thing I do is simply learn more about a character with some quick research, even if I think I’m already familiar with the character. What is their canonical appearance? Are they commonly depicted in different forms? What are the individual components of their typical outfit? Are there weapons/accessories/etc associated with them? In what kinds of environments are they typically found? Do they have any unique markings or special anatomy that I need to pay attention to? I write down the most important findings in a text editor and save it as info.txt, which eventually becomes the info file I include with my completed models.
I then jump onto e621.net and search for the character to see what I’ll be working with in terms of training material. This helps me figure out the tags and settings I’ll use to download the dataset in the next step. It also allows me to cross-reference my earlier findings and get a feel for how the artistic interpretations may differ from canon. It is often the case that–because we’re working with NSFW material, and because most mainstream characters have no official NSFW depiction–the nude appearance can vary quite a bit from image to image. I’ll make note of any important deviations in my info file.
When searching I typically start with a prompt to sort by score and exclude animated and monochrome posts, e.g. <character tag> order:score -animated -monochrome. If possible I also include the solo tag since working with solo images makes the whole process easier and cleaner. For characters that don’t have enough quality solo pics I will try using the tags ~solo and ~solo_focus to get more results. Ultimately what I’m trying to do is build a search query that primarily returns images where the character is visually isolated, unobscured, and has simple poses.
Depending on the character and how flexible I want the model to be, I like to see my query return 3 or more pages of results (200+ posts). I also look at the last few pages of results to determine an appropriate minimum score threshold that I can use in the next step to avoid downloading low-quality images that I would end up pruning from the dataset anyway.
Step 2: Downloading the dataset
Using the query I formulated in the previous step, I use pika__’s e621 downloader to obtain the dataset. You’ll need to read through the documentation there to understand how to install, configure, and run the downloader. Alternatively you can try <>(CK)<>’s dataset curation tool webui which is built on top of the downloader and includes some nice features like tag editing and auto-tagging. Either way, the resulting dataset will be a few directories of images along with caption files containing the comma-separated e621 tags for each image.
A non-comprehensive list of my typical downloader settings:
required_tags
This is the ‘positive’ part of the search query. Tags should be separated by commas (e.g. <character name>, solo) to form a single tag group. Multiple tag groups are separated by | and can be used to fetch images that match any group - e.g. krystal, solo | krystal, solo_focus would match images that have both the krystal tag and either solo or solo_focus. This is the equivalent of searching krystal ~solo ~solo_focus on e621.net.blacklist
This is the ‘negative’ part of the search query. I usually keep all the default blacklist tags in place and just add monochrome and any other ‘negative’ tags from my query.min_score
This is the minimum score threshold I determined in the last step. This will vary depending on the character and how much pruning I want to do, but for me it’s usually between 30 and 100.min_area
The minimum image pixel count; images smaller than this will not be downloaded. The default is 262144 which is 512x512. Recently I’ve been training at 768x768 so I have this set to 589824.sort_by
How posts should be sorted, I leave this as score_desc.top_n
The maximum number of images to download, as sorted by sort_by. Again this may vary depending on the character and how much pruning I want to do, I commonly set this between 150 and 300.prepend_tags and append_tags
Any tags I want to add to the start or the end of the caption files should go here. I usually leave these blank since I will be editing tags in a later step.replace_underscores
This replaces underscores with spaces in the downloaded tags. I set this to true since I train on base models that don’t use underscores.max_short_side
The maximum dimension of the shortest side of an image. Anything larger than this will be resized down to the value specified here. Resizing saves space and might improve training times depending on the training settings. The default is 768, but because I do a lot of cropping and editing of images I set this to 2048 to help retain detail.delete_original
Deletes the original image after resizing. I set this to true because it saves space and it’s fairly trivial to download the original(s) again if necessary.
Step 3: Pruning the dataset
Now that I have the images and caption files downloaded, it’s time to remove the ones that aren’t a good fit. This can be approached in different ways, but I just consolidate all the files into a single folder and use a file explorer to preview the images one at a time, deleting the image along with its caption file if it doesn’t meet my criteria.
Here is what I look for:
Character likeness - The most beautifully crafted masterpiece is worthless if it doesn’t properly portray what I want to train. I’ve thrown away many highly-scored images due to the character losing its…well…character…within an artist’s heavy-handed style.
Clothing and accessories - I look for images that cleanly portray any part of the character’s typical outfit, instruments, or accessories, as well as any other visual aspects I had previously noted in my info file. Each of these should preferably be present in at least one quarter (preferably half) of the final dataset. Conversely, if I want the model to be adept at generating the character without the outfit/accessories then they should be missing from at least one quarter of the images (again preferably half).
Visual isolation - As mentioned in a previous step, I want the character to be visually isolated and unobscured. I avoid objects or other characters that are interacting with or overlapping the character I am trying to train, but in some cases it may be worth the effort to manually crop, blur, or otherwise edit out those elements. Some images portray the character in multiple poses, and these can often be cropped out into separate images.
Simple poses - Similar to visual isolation, I avoid images where the character is significantly obscured by its own body, or is twisted into a convoluted mess. Things like foot-fetish shots where the character’s foot covers most of the frame and only half of their face is visible, or autofellatio where the character is folded in half and the legs are behind the arms. Simpler poses such as standing, sitting, squatting, all fours, lying on back/side/front, etc are all generally acceptable. Even crossed/folded arms or legs are fine as long as they are kept to a minimum.
Variety - I want some variety in art style, backgrounds, poses, facial expressions, etc. If all I train on is 3d expressionless frontal t-poses on a pink background, it’ll be hard to convince the model to generate anything other than 3d expressionless frontal t-poses on a pink background.
Headshots - The face is usually the most recognizable part of a character and is therefore critical for training, so I want a good number of headshots or one-quarter portraits in the dataset. Typically only a small handful, if any, of the original images will be headshots, so I keep an eye out for high-quality images that can be duplicated and cropped to create these headshots. Sometimes images that don’t meet my other criteria can instead be cropped and used as headshots.
Visual quality - All else being equal, I want images with a high degree of visual fidelity. Blurry brush strokes and sloppy sketches make for blurry and sloppy renders. Crisp lines and refined shading are king. 3D artwork is also great for adding a boost of sharpness and consistency as long as it fits the rest of the criteria and I don’t include too many.
At the end of this process I aim to have 50-100 images remaining, possibly more if it’s a character I really care about. As long as they are all decent quality and there is sufficient variety, more images will yield a more flexible model, but it also means there are more images that need editing and tagging. Note that it is possible to train a model with fewer images, perhaps as few as a couple dozen, but flexibility will take a hit.
Step 4: Image prep
Now that I’ve pruned my dataset, it’s time to prep the images. This involves editing the images with three objectives: composition, clarity, and consistency. Using an image editor (I use Affinity Photo, but most anything should work) I go through each image and make selective edits for each objective. Sometimes I perform all the edits on a single image in one go before moving on to the next image. Other times I perform the edits in phases, editing all the images for one objective before moving on to the next objective.
Composition
This primarily involves cropping the image to leave only a small margin around the character, effectively increasing the percent of pixels that represent the character. This improves fidelity when training. Cropping is usually the easiest part and is arguably the most important part. I don’t always make edits for clarity and consistency, but I do always crop the images. During this phase I will also identify, duplicate, and crop images appropriate for headshots. This is also the time when I break out multi-pose images into individual images.
Because the training process allows for aspect ratio bucketing, it is not necessary to crop images to a 1:1 square aspect ratio (although I often do that for headshots), but be aware that very tall or very wide images might be automatically excluded during training if they do not fit within the allocated buckets.
It is important to update the caption files as appropriate when cropping images, deleting any tags that reference characters, objects, or elements that have been removed. This can be done at a later step, but I find it’s easiest to make the changes right after editing the image while I still have full context.
Clarity
Going back to visual isolation, the objective of clarity is to remove distracting or unrelated elements from the image to improve the signal-to-noise ratio. Careful cropping can accomplish this to an extent, but it can’t catch everything. A different character’s elbow intruding on the side, an artist’s signature or watermark nested in the corner, frequent pictographics or emanata, text or dialog or speech bubbles…none of these are elements I want to train alongside the character. Using either my image editor or inpainting in SD, I will edit these elements out of the image if I can do so without compromising important parts of the character. In some cases I will even remove or replace the entire background. And, of course, I will update the caption file to reflect any changes.
Please note that I don’t always make exhaustive changes for clarity for every model I train; this full treatment is usually reserved for the characters I care about the most. For the other models I typically identify and fix only the most egregious offenders, such as an artist logo found in the same location across multiple images.
Consistency
Most of the images in the dataset are the result of some amount of artistic interpretation. While this is generally a good thing that introduces desired variety, it also leads to inconsistencies that I might want to avoid in the final model. I quite frequently see characters depicted with the wrong eye color, nose color, hair color or length, body/fur color, or missing or incorrect facial markings or fur patterns.
If most of the images align with the expected character features, I will use an image editor or SD inpainting to fix the problematic images where possible. In some cases where the character is commonly depicted one way or another, I either pick a side and edit the images for consistency, or I make note of the multiple depictions in my info file so that I remember to add the relevant tags in the next step so that users can choose which depiction they want when rendering with the final model.
Step 5: Tagging
Tagging Ideology
The primary purpose of tagging is to associate the visual information of the character with a specific ‘main’ tag, and to associate any outfits, accessories, or other ‘toggleable’ elements with their own specific tags as well. The other purpose of tagging is to identify unrelated items or concepts in the training material so that they will not be implicitly associated with the aforementioned tags.
When to consolidate tagsTo achieve this I’ll want to consolidate some tags. I define one tag to describe the “what” that I want to train, and remove any other tags that describe the core characteristics of that “what”. Take one of my recent models, Retsuko, for example. Retsuko is the “what” - a red panda with tan body fur, white ears, brown inner ears, a mouth, a small black nose, white facial markings, tan facial markings, eyelashes, white eyebrows, etc. These are core characteristics that are present in all images in the dataset, and are all characteristics that I want to be rendered whenever I use the main tag retsuko in my prompt. So I add the tag retsuko to all the images and remove all related descriptor tags like tan body, facial markings, black nose, eyebrows, etc.
A noteworthy exception to this example is Retsuko’s eye color. Canonically her eye color is entirely black with a black sclera, but the training material frequently depicts her with brown eyes with white sclerae. I had a similar conundrum a while ago when training Retsuko’s coworker, Fenneko. In Fenneko’s case I opted to manually edit each image to unify the dataset with black eyes, and was therefore able to omit tags like black eyes, black sclera. For Retsuko, however, I opted to place that choice in the hands of the user by appropriately tagging the images with either brown eyes or black eyes, black sclera.
A character’s outfit and major accessories can be approached similarly. For example, the top typically worn by Roxanne (FNAF) is a red crop top shirt with a pointy black leaf-like design. I removed descriptor tags like red shirt, black leaf, crop top, pattern shirt, etc and simply added a new tag, roxanneshirt.
When to add tagsI add tags for two reasons: to identify and describe things that are not part of the character; and to describe the state of things that are part of the character.
Objects, characters, concepts, and art styles that are not part of the character should receive at least some cursory tagging. This helps the model separate the character from its environment and other unrelated elements. If a character is frequently drawn standing next to a tree, and I do not tag the tree, then the model will associate the tree with the character’s main tag during training and will tend to render the character next to a tree regardless of whether I have tree in my prompt. I tag things like art style, background color/type, furniture, objects, text/dialog, etc. Using solo images tends to simplify this process quite a bit. Auto-tagging can help cover a lot of ground here as well, but requires some cleanup.
I also want to tag the state of the character, i.e. aspects of the character in that particular image that are not considered core characteristics. Pose, perspective, hand placement, actions, and even clothing that the character doesn’t normally wear. At some point I noticed that my earlier models tended to produce disfigured mouths and messed up eyes, lips that turn into pseudo-teeth or tongues, and transparent eyelids or double pupils. I found that thoroughly tagging the state of the eyes and mouth helps with this. You can find a non-comprehensive list of tagging suggestions at the end of this document to get a feel for what I tend to tag.
Auto-Tagging
So now that I have finished prepping the images, it’s time to move on to tagging. For the most part I can assume that the tags downloaded with the images are reasonably accurate–aside from any images I have cropped or edited–but they are not necessarily complete. To augment the existing captions, I use two different interrogators for auto-tagging: wd-v1.4 and the e621 image tagger. Installation and usage of these interrogators is beyond the scope of this document, so you’ll need to figure that out yourself if you intend to use either of them.
I use the two different interrogators to obtain both booru-style (anime-style) tags and e621 tags because I train my LoRAs on two different base models - one which responds better to anime tags and one which responds better to e621 tags. Ideally I would have two separate caption files for each image (one caption file specific to each base model), but that sounds like a pain so I just keep both tag types in a single caption file and hope for the best. If I were only training against one model or the other then obviously I would only use one interrogator.
It’s a good idea to adjust the interrogator thresholds to an appropriate level. Too low and it will produce a lot of inaccurate tags, too high and it will miss a lot of accurate tags. For the e621 tagger I typically use a threshold of 0.7. For the wd-v1.4 tagger I use anywhere from 0.4 to 0.7 depending on how close the character is to something that the anime model can already produce. Characters similar to wolf girls, fox girls, or cat girls aren’t a huge stretch for the model so I go with a lower threshold for those. Characters like imps, scalies, and especially feral are a bit tougher so I go with a higher threshold there.
Editing Tags
Regardless of the threshold, auto-tagging is never super accurate and tends to require some cleanup. After auto-tagging, I use the dataset tag editor extension for auto1111 to clean up and edit the tags. It’s not the most intuitive interface and there may be better options out there, it’s just the one I happen to use.
Main tag(s)I start by adding the main tag for the model to the caption for all images. This is usually just the character name. Short names have fewer tokens and tend to perform worse at associating with the character, in which case I also include the character’s last name if they have one. I’ve also heard that adding numbers or highly unique words to the main tag can strengthen the association, but I haven’t personally tried this.
With the training settings I use, it is important that the main tag be the first tag in the caption file. I assume most tag editors have the option to rearrange tags or prepend caption files with a specific tag. After adding the main tag I perform a bulk removal of tags that I will be consolidating into the main tag - species, body/hair/eye color, markings, etc. See Tagging Suggestions for examples of what I tend to remove. Next I create and consolidate tags for the character’s main outfit(s) and major accessories.
CleanupNext I look for and remove or change any inaccurate tags added by the auto-tagger. A lot of this can be done in bulk if the editor has the capability, and afterward I often go through each image individually to clean up any stragglers. It helps to become familiar with the tags and tag style used by the base model that will be used for training, as this will give you a better idea of what should be changed or removed. At this time I also double-check the captions of any images I have cropped and edited to make sure I have properly cleaned up those tags.
Tag all the thingsNext I tag the things that are not core features of the character. Art style, backgrounds, objects, other characters, furniture, text, etc. With any luck the auto-tagger will have covered a lot of this and I’ll only need to make minimal additions or corrections.
Where things can get tedious is tagging the state of the character, as the auto-taggers tend to miss a lot here. Pose, perspective, actions, eye and mouth state. Again, refer to the Tagging Suggestions to see examples of what I normally tag. This can certainly take a while, but in my experience this process adds noticeable flexibility to the model.
Reinforcement TagsThroughout all of the tagging process I frequently refer to and update my info file, writing down new tags I have added and removing notes for things I have already tagged. You may notice that the final info.txt file included with most of my models lists a set of reinforcement tags. These are typically just tags I have identified that users can include to help guide base models to more specific outcomes; I don’t usually go out of my way to add these tags to the dataset.
Step 6: Training
I won’t try to cover the whole technical training process here as there are plenty of tutorials out there and many different ways to approach it, so I’ll just cover some of the settings of my particular setup.
After tagging is complete, I rename the dataset folder to 10_<main tag>, e.g. 10_retsuko. I don’t think it’s actually necessary to include the main tag in the folder name because I’m using caption files, but it does make searching for the folder easier if I ever need to find it again in the future. The number at the start represents the number of repeats per epoch. It’s possible to train multiple concepts in one model or place more emphasis on a subset of images by using multiple directories with a different number of repeats. But for the sake of simplicity I usually keep everything in a single directory with 10 repeats.
For training I use LoRA Easy Training Scripts which allows me to use a json template file that I can easily adapt for new models, and can queue up multiple trainings if I provide multiple config files. Here is an example of that config file.
Learning Rate
Looking at the sample config file, the first thing to note is that I use 10 epochs and that the learning rates are weird. How under- or over-baked a model is depends in large part on both the learning rate and how many times it ‘sees’ an image during training. The latter is determined by the number of epochs, the number of repeats, and the total number of images in the dataset. If I have 80 images with 8 repeats, the model will ‘see’ 640 images per epoch, and if I have 10 epochs then the model will ‘see’ 6,400 images total throughout the training process.
When I first started training models I would use the same learning rate (0.0001) each time and then play around with the number of repeats and epochs, often training multiple times until I got something that worked well. These days I’ve discovered that it’s much simpler to use the same number of repeats (10) and epochs (10) each time and calculate the learning rate based on the number of images in the training set. Here is the formula I use:
L = S / (R E I)
L is the learning rate, S is the ‘strength’ of the training, R is the number of repeats, E is the number of epochs, and I is the number of images in the dataset. For the ‘strength’ of the training, a value of 1.0 nominally represents a fully baked model, but the actual value I use depends on the character and which base model I’m training with. The more difficult it would be for the base model to generate the character in the absence of a LoRA, the higher the strength I use. When training most furry characters on an anime model I typically use a strength value of 1.1, but I may use 1.2 or even higher for characters that are a bit more unique, e.g. Discord (MLP).
Looking at the sample config file you can see how I derived the learning rate using a strength of 1.1, 10 repeats, 10 epochs, and 81 images in the dataset:
1.1 / (10 10 81) = 0.00013580246913580247
I then take that learning rate and simply multiply it by 0.5 or 0.6 to obtain the text encoder learning rate;
Resolution
The minimum resolution I train at is 512x512, and training at higher resolutions does seem to improve results. If possible I prefer to train at 768, but this does take significantly longer to bake so it may not be worth the extra time without access to powerful hardware.
When using aspect ratio buckets with my particular setup I also need to adjust the min and max bucket resolutions. At a resolution of 512 the min and max bucket resolutions are 320 and 960 respectively. At 768 these become 480 and 1440. Some base models might perform better with LoRAs trained at specific resolutions; you’ll want to do your research there for the base models you plan to train on.
For the batch size, a higher batch size can improve training times but the max batch size will depend on your hardware specs and training resolution. On an RTX 4090 with 24GB of vram, I can use a batch size of 12 when training at 512 but only a batch size of 5 when training at 768. You will need to experiment with your particular hardware to figure out your particular batch size limits, and keep in mind that the maximum possible batch size doesn’t always yield the fastest training times.
Network dim, alpha, and LoCon
I use a network dimension of 128 for nearly all of my models. It is absolutely possible to produce smaller models that perform similarly with dim sizes of 64, 32, and probably 16, but I’ve had no issues using 128 thus far so I haven’t had much reason to change it.
For the alpha value, when in doubt I use 64 but this is something you’ll want to play around with. Sometimes 128 yields better results, sometimes ‘null’ produces better results.
If I am training with LoCon enabled, I use 32 for both the dim and alpha of the convolutional network. I haven’t really tried other values since 32 seems to work well enough.
Keep tokens
I enable the ‘shuffle captions’ option when training to put all the tags on an even playing field. However, I don’t want that for the main tag. It’s the most important and should always be the first tag. The ‘keep tokens’ value determines how many tags at the start of the caption file will be excluded from shuffling. Because I usually only have 1 main tag, I set this value to 1. (Note that despite using the term ‘token’, all the research I’ve done indicates this setting applies to whole comma-separated tags and not individual model tokens). This is why I added the main tag to the beginning of the caption file during the tagging phase.
Once all of my settings are in place I start the training and let it bake. Once finished I’ll throw it into my lora directory and give it a test run to make sure that it is properly capturing the character likeness while remaining flexible and not forcing specific poses, objects, or styles. I also check to make sure that outfits and accessories show up when prompted. At this point if everything looks good I’ll call it done, otherwise I’ll go back and add more tagging or adjust training settings as necessary and try again.
Trash-Tier Training
I occasionally need to train a quick-n-dirty model with minimal time investment, sometimes as part of researching the viability of a proper model for certain characters, sometimes for 1-off renders, and sometimes for playing around with training settings. My process here is more or less an abbreviated form of the full process, taking a fraction of the time and effort while producing a somewhat usable LoRA.
Character research - Generally just a quick search on e621.net to get a rough idea of how many images I’m working with.
Downloading the dataset - Usually 100 images or less using the tags <character name>, solo.
Pruning the dataset - Still going for character likeness here, but less picky and throwing out only the worst offenders and lowest-quality images
Image prep - Quick basic cropping only, and I don’t usually bother updating the caption files.
Tagging - Auto-tagging with little to no cleanup, add main tag to the start of the caption files, and maybe some quick tagging around major outfits and accessories
Training - Same settings as normal except I adjust resolutions to train at 512 instead of 768.
Tagging suggestions
Consider removing/consolidating tags related to:
- eye color
- body/fur/skin colors
- markings, tattoos, patterns
- countershading
- gloves/socks (markings)
- multicolored / two tone / dipstick
- scars
- etc
Clothing and accessories:
- hat, headwear
- glasses
- clothing
- hat, headwear
- glasses
- scarf, bandana, etc
- Topwear: shirt, jacket, vest, coat, bra, etc
- Bottomwear: pants, shorts, panties, etc
- Gloves, armwear
- Shoes, boots, socks, stockings, legwear, etc
- jewelry
- collar, choker, necklace, pendant, etc
- earrings, piercings
- bracelet, anklet
- finger rings
- accessories
- hair band
- etc
Perspective:
- rear view, from behind
- side view, from side
- low-angle view, from below, worm's eye view
- high-angle view, from above, bird's eye view
Pose:
- standing
- sitting, kneeling
- squatting/crouching
- all fours
- bent over
- crossed arms/legs
- lying
- on back
- on front
- on side
- actions
- running
- jumping
- holding <object>
- etc
Eyes:
- wide-eyed
- half-closed eyes, narrowed eyes, bedroom eyes
- one eye closed
- closed eyes, eyes closed
- looking back
- looking at viewer
- looking away
- looking to the side, looking aside
- looking up
- looking down
- looking forward
- looking at another, looking at partner
Mouth:
- closed mouth
- open mouth
- parted lips
- grin
- clenched teeth
- teeth, sharp teeth, fangs
- tongue, tongue out
- lipstick
- smile, frown, smirk, etc
Text and icons:
- text, english/japanese text
- dialogue, speech bubble, talking
- signature, watermark, artist name
- url, logo
- <3, heart, star, pictographics, emanata, etc
- exclamation point, question mark, etc
Style:
- cartoon, toony
- 3d
- realistic
- monochrome
- shaded
- flat colors
- sketch