What if I told you that all you need to train a useful Lora is one image?
It turns out that Single Image Dataset (SID) Loras perform much better at certain tasks than those with a small number of images. The Lora is able to focus in on learning just the content of that one image, which means it can replicate elements from that specific image really well. It also remains surprisingly flexible when you reduce the weight and train it on concepts that the base model understands well.
Single image dataset Loras are not the best quality, they overfit to the limited data extremely easily, but they make up for this by being cheap to produce and hyper-specific. They can be trained in just a minute or two on readily available hardware.
The basic philosophy here is to fix quality issues with generation techniques, not through superior training.
Also, I'm not suggesting you share most of these Loras here on Civitai. They are usually intermediate products that help produce a specific effect. That said, if you find one that is generally useful and decide to share it then make sure to tag it appropriately and include all the details needed to use it.
Enough overview, what are these single image dataset Loras useful for?
Eliminating elements from the original using captions
Adding new elements to the original
Creating variations on the original
Assisting with the inpainting process
Transferring the original to a new style
Using the original as a reference image for new generations
Using a specific element from the original in a new generation
Composing two or more Loras together to use specific elements from each
Bootstrapping a single image into a full dataset for training
If any of these seem exciting to you, then read onward for training recommendations or just scroll down to look at the examples with pretty pictures.
Training A Lora
This guide isn't a step-by-step Lora training guide for beginners, there are plenty of those already and there are several different systems (I use kohya_ss on Linux), but there some of the specifics to go over to make these results easy to replicate. Other guides often focus on dataset preparation, but that's very easy when you have just one image to prepare.
First, you need to select an original image. Let's use this one for most of the examples:
You can use whatever image you want.
Next is model selection. This is one of the most important details to get right, because the base model you choose to train your Lora on determines which prompts/captions the model understands and thus what it can effectively learn from your image. With just a single image, it can't learn a new concept with any sort of flexibility, so it needs to know any concept you want to use (or a related one) beforehand. I've also found that high quality models tend to bias towards high quality results, so it's best to start from a good starting point. It's also recommended to pick a model that is somewhat similar to the desired style, but it doesn't have to match exactly.
I'm going to use FoolCat 3D Cartoon Mix because that was the model that generated this image in the first place, and will thus be highly compatible with it.
Next you need to decide on your captions. You only need to caption one image, so this isn't much work.
Some training systems (like kohya_ss) have parameters like instance or class, but actually it's all just captions so you can just ignore these. Kohya_ss ignores the instance and class if you include a caption file, and when it does use them it's just captioning the image with "instance class".
Captioning is a huge topic. A good rule of thumb is that if you prompt the base model with the caption and it generates something reminiscent of what you want then using that caption during training should focus its attention on that concept in your training image and "pick it out" into that caption.
A very important point here is that each caption you add during training will subtract the content of that caption from the rest of your captions. For instance, if I use "holding red apple" as a caption for this image, then generations using the resulting Lora will NOT include the character holding a red apple unless that prompt is used. You can use all of the captions together to add everything back together and get the original image. On the other hand, if I did not include the caption "holding red apple" in the training then other captions will (collectively) contain that element (weighted by their relationship). The Lora itself also takes on some background bias, which is especially strong from uncaptioned elements, which will appear in all generations using the Lora.
Anyway, I plan on writing a future article on this topic, but some quick notes on captioning before I move on:
Single image dataset Loras are incredible experimental laboratories and you should experiment with your captioning to find out what works
Captions not present in an image have low weight and thus don't impact the final result much (though they can end up learning a new concept, which can be good or bad)
I'll be using Booru-style prompting for these examples but other prompting styles have different advantages and one is not better than the other
Multiple tokens/words in a single caption can be used together as a unit to replicate the original element closely or pulled apart and re-combined after training
You want to caption negative aspects of an image to concentrate them into captions that express them, for instance if the original image has text but you don't want text then be sure to include a caption for "text" (or "LANGUAGE text") to capture the text into that caption
For this example I will use the following captions:
woman, blue blouse with pocket, buttons, id card badge, short blonde hair, blue eyes, sitting, holding red apple, wooden desk, hand on papers, cup with red pen on desk, chair, from front, blush, blue wall, wooden panels, window, cityscape, tree
You do not need regularization images. These work against you if used incorrectly and you can get good results without them. I also plan on making a future article on this because they are especially valuable in the context of single image dataset Loras, but tricky to use.
If you want to experiment with this, it simply trains a different regularization image between each training step. You can use this to partially "cancel out" training you don't want and reduce unwanted bias, retaining flexibility at the cost of training time and dataset preparation (both of which are extremely low cost already for single image dataset Loras). That said, you should start without regularization to build up your intuition.
For the most part, the Lora training defaults are just fine (at least in Kohya_ss), but there's a few worth mentioning:
Repeats - In Kohya_ss this is in the dataset prep, it's how many times the image is trained per Epoch. I recommend 20 because it's a nice round number and works well for regularization (20 regularization images needed per image). You can use anything but I recommend numbers with lots of factors.
Epochs - The number of epochs needed for training can vary dramatically, but as a rule of thumb your Lora should be trained around step 120 (6 epochs with 20 repeats) and you shouldn't see any more benefits by step 400 (20 epochs with 20 repeats). Generally, the best results are somewhere between step 100-200 (epochs 5-10 with 20 repeats) but higher steps/epochs can be very useful for keeping the results close to the original. I usually use epoch 6 or 20. (Also note that these guidelines only apply to single image datasets, multi-image datasets can easily take far more steps.)
Lora type - Standard Loras work fine for this. Other types may be more effective or allow a smaller rank/size, but my experiments didn't uncover a dramatic difference in absolute quality for single image datasets.
Max resolution - This should be set to the size of the training image or to the size you want it resized to. It's 512,512 in this example.
Network rank - Larger network rank means a bigger Lora (in terms of hard drive space) but also more space for the Lora to learn. I usually use 64, but this can probably be smaller. (I haven't explored this parameter much, but generally you want to use the smallest rank that works effectively.)
Network alpha - Alpha should be some percentage of network rank and 100% isn't using alpha. I usually use 50% of the rank, so 32. (I haven't explored this parameter much.)
Buckets are totally useless for single image datasets, even though they're great for more complex datasets.
Max token length - if you use more than 75 tokens (as measured during prompting) in your captioning, make sure this set high enough or it'll crop your captions. More than 75 is probably a sign you're overdoing it, but there's nothing wrong with increasing this if you're using that much.
Shuffle captions and keep N tokens - Shuffle captions can help with flexibility by moving your captions around randomly during training. Keep N tokens keeps that many comma separated phrases at the front (NOT tokens as measured during prompting despite the name). The first token seems to accept much more "residue", so it's a great place to put a new concept or character name. You can use "keep 0 tokens" if there's no special trigger phrase or "keep 1 token" if so. It's also recommended that you pair a totally novel token with a known concept to help with learning, like if you wanted to name this character "bh" then use "bh woman" as the first prompt.
Memory efficient attention - Pretty much mandatory unless you have a huge amount of VRAM
Flip augmentation - This effectively turns your single image dataset into a 2-image dataset: the original and a horizontally flipped version. It greatly weakens the directionality of the image, which can be great or bad depending on what you're using it for. If you want to track more closely to the original, turn this off. If you want to use elements in both directions or otherwise have a more flexible model, then turn this on.
Samples - Not necessary, but it's nice to see the progress of your training especially when you're starting out.
Once you've set this all up, train your model. On my hardware (NVIDIA GeForce RTX 3080 Ti) it takes around 2 minutes to train a single image dataset Lora to epoch 20/step 400. Then load the results into your favorite generation program (I use Automatic1111).
It's useful to think of the different epochs produced during training as another parameter that you can adjust. Higher epochs will be better trained and reproduce the original better, but lower epochs will retain more flexibility. If you bring over all the Lora epochs then you can dial in any epoch you want.
That said, there's usually two "phase changes" from "early learning to trained" and again from "trained to inflexible".
If you plan on using a Lora long term or sharing it, then it's usually best to track down the optimal epoch(s) that give the best results and discard the others.
Keep in mind that overfitting/inflexibility isn't always bad in the context of a single image dataset Lora, you might want a close replication rather than flexibility.
Also note that the less data you have the narrower the "optimally trained" window is, so single image datasets can easily have moved on to inflexibility by the time training is complete.
Examples of Use Cases
Alright, so we've got our Lora trained, it's time to use it. Keep in mind that unless otherwise noted all of the following generations were done purely with txt2img. Each image will also have the generation prompts and parameters attached.
Replicating the Original
Before getting into the useful stuff, it's good to consider how to replicate the original image closely using the Lora. This is a useful baseline and demonstrates most of the core techniques.
Set up your environment as follows:
Select the model you trained on (FoolCat 3D Cartoon Mix)
Use the same resolution you trained on (512,512)
Setup any other parameters that typically work well for you
Enter in all your captions as prompts
Include your Lora at a high weight (1.0 is ideal for this) (this example uses epoch 20)
woman, blue blouse with pocket, buttons, id card badge, short blonde hair, blue eyes, sitting, holding red apple, wooden desk, hand on papers, cup with red pen on desk, chair, from front, blush, blue wall, wooden panels, window, cityscape, tree <lora:woman_with_apple_01:1>
Then generate, here's an example:
It's close, but not exactly the same and the quality has taken a bit of a hit. That's okay though, because you can add in some negative prompts (or positive prompts, but I find those less valuable) to fix some of the quality issues. I've found the most valuable negative embeddings are:
There's plenty of others which can help (for instance I also find BadDream valuable but it's enormous), so use whatever works best for you. Here's the result of using the two recommended negative embeddings:
Much better. You'll usually get even better results by using Hires fix, here's a doubling of the resolution:
Pretty good. You shouldn't expect to get the original back exactly, but you already have the original so a slight variation should be good enough.
Also, a note: this example actually gave me some issues with Hires fix introducing severe folding into the character's blouse (which you can still see if you look closely at the upscaled image). I was able to mostly fix it by reducing the denoising strength from 0.5 to 0.4. The denoising step seems to introduce these artifacts, so I wasn't able to wash it out completely.
Another thing worth trying out if you have quality issues is to reduce the strength of the Lora. If you have a good base model, it's then it will be good at its job. Reducing the Lora strength will give the base model more control and thus give it more leeway to fix issues. You can make a pretty good replication all the way down to strength 0.6 if you are using all of the captions as prompts.
Now that you can replicate the original, eliminating an element is easy: simply remove the prompt you want eliminated. Here's the same generation without the apple:
In this example I used a Lora strength of 0.8. Lowering the weight makes it easier to make changes, so if you're having trouble eliminating an element then reduce the Lora weight. If simply omitting the prompt isn't working you can use it as a negative prompt instead, or even use a higher weight on the negative prompt. Here's an example of removing her pocket that required a negative prompt with emphasis and reducing the weight to 0.7:
Note that certain elements seem to be understood by the model as "inherent" to another prompt and are extremely hard to remove using negative prompts as a result. For instance, it's very hard to remove the tires from a car in most models. That's not to say you can't remove these elements, just that this approach won't work well.
To add an element, simply add the additional prompt. You'll generally need an even lower Lora weight (I use 0.6 for these examples). Here's one that adds a "(gold necklace)":
You can replace old elements with new ones as well. Here's one where "holding red apple" is replaced by "holding blue ball":
If you run into difficulties, use the techniques that follow to make more substantial alterations.
You can use the Lora to keep the generations fairly stable and generate variations on the original. For instance, here's what the prompts generate without the Lora:
There's no apple and many of the other elements only show up sporadically. In contrast, here's a generation with the Lora's epoch 6 at strength 0.6:
(Meta note: I had to downsize this grid image to fit under Civitai's filesize limit, it uses the same prompt as the other grid but with the Lora at strength 0.6.)
The version with the Lora retains the essential character of the original while producing variations. You can also dial in how close to the original you want to stay.
Since the Lora causes the overall image to converge on the original, it can reduce how much overall change occurs, making inpainting much easier.
In the "holding blue ball" example earlier, I had trouble getting it to generate a green ball (there's a lot of blue in the image already so it was able to generate a blue ball without much fuss). So, I used several rounds of inpainting to accomplish this without needing to be particularly precise with masking.
This example uses img2img/inpainting. Also note that I used epoch 8 instead of epoch 20.
Starting with the original (sure why not), I did 2 rounds with "woman, (hand holding green ball) <lora:woman_with_apple_01-000008:0.6>" and the negative embeddings:
For the third and final round, to clean up, I switched to "hand holding green ball <lora:woman_with_apple_01-000008:0.4>":
I'm not sure if it's worth training a Lora just to assist with inpainting, though if fully automated the prep work and training could be done in about a minute. It's certainly nice if you have it already available.
This kind of Lora really works well with other models, so I just regenerated it in Absolute Reality to make a photorealistic version of this character.
Bam, she's a real person!
This was at epoch 20 with a strength of 0.8, so extremely strong and yet the photographic quality is only modestly impacted by the style of the original.
I think the reason this works so well is that (as long as you pick a reasonably close model to train on) it's learning the specifics of your training data in a similar stylistic context and therefore the Lora's weights on the style are low. So, when it's put into a new base model context it's basically using the new style since it doesn't have strong opinions about that.
Note that if you use a really different model to train on, this works differently with it learning the style more heavily. Also note that you generally just need to get it in the right ballpark of style for things to transfer well, and the stylistic elements it does learn might be valuable depending on what you're going for.
Alright, it's finally time to leave behind small variations on the original image and make more dramatic changes by using the Lora as a way to use the original as a reference image.
As you drop the Lora weight further and reduce the number of captions you use, the less and less your generations will resemble the original. In particular, a Lora weight of 0.5 seems to be a key turning point with the base model being more in control below a weight of 0.5 and the Lora being more in control above 0.5.
Let's go on a jog:
woman, blue blouse with pocket, buttons, short blonde hair, blue eyes, running, from side, breathtaking autumn forest <lora:woman_with_apple_01-000006:0.4>
Or visit a cafe:
woman, blue blouse with pocket, buttons, id card badge, short blonde hair, blue eyes, sitting, holding red apple, chair, from front, blush, cafe, countertop, coffee <lora:woman_with_apple_01-000006:0.6>
In addition to weight, captions have a massive impact too. Even at a relatively high strength of 0.8 we can completely replace the background:
woman, blue blouse with pocket, buttons, id card badge, short blonde hair, blue eyes, sitting, holding red apple, wooden desk, hand on papers, cup with red pen on desk, chair, from front, blush, outdoors, picnic table, park bench, trees, sunny, river, cityscape <lora:woman_with_apple_01-000006:0.8>
Or we can replace the character with a different one:
woman, simple rainbow dress, long pink hair, brown eyes, sitting, holding red apple, wooden desk, hand on papers, cup with red pen on desk, chair, from front, blush, blue wall, wooden panels, window, cityscape, tree <lora:woman_with_apple_01-000006:0.8>
Another thing worth noting is that if you change the resolution from the original one of three things can happen:
The Lora stretches the original to fill the new space (especially at high strengths and epochs, or relatively modest changes to resolution)
The base model outpaints new material into the new space (possibly with some more modest stretching)
The original composition "breaks" and a completely new one is able to be generated
This last possibility can make changing resolution valuable if you want something substantially different.
Taking this even further, you can use individual elements from your Lora in very different compositions by using few captions. It's often surprising what you can make use of in different contexts. As an example of something pretty different, I re-used the unusual ID card strap in a military uniform:
man, military outfit, standing near heavy equipment, (id card badge), outdoors, muddy <lora:woman_with_apple_01:0.8>
There it is, the ID badge. At a high weight of 0.8 you can see that other aspects of the original image bled heavily into the new composition, with the shirt being very similar. Here's one at a lower weight:
man, military outfit, standing near heavy equipment, (id card badge), outdoors, muddy <lora:woman_with_apple_01:0.6>
Most of the other generated images were similar with the ID badge having a leather strap now.
By taking two or more Loras (single image or not) and composing them together, you can pick and choose which elements you merge together. Single image dataset Loras can work pretty well together to combine the elements of two specific images into a new composition.
For this one I made two new images and used them to train single image dataset Loras unrelated to the running example.
The first one is a character similar to our original example sitting on the hood of a car. I chose this one because the position was somewhat unusual and fun:
woman, sitting on car hood, short blonde hair, blue eyes, red hoodie, jeans, black heels, red sedan, white blouse, hands on hood, crossed legs, road, trees, sky, mountains
The second one was because I realized that composing our running example onto this image wouldn't be all that impressive, so I needed a very new character:
woman, long red hair, brown eyes, medical face mask, ear, earring, close up, white coat, black blouse, tie, bokeh
Alright, so let's combine these quite different images together:
woman, sitting on car hood, long red hair, brown eyes, medical face mask, ear, earring, white coat, black blouse, tie, jeans, black heels, red sedan, hands on hood, crossed legs, road, trees, sky, mountains <lora:sitting_on_car_hood_01:0.6> <lora:face_mask_01:0.6>
This is pretty good considering how different the two images were. Most of the outputs were similar, so it wasn't difficult to get this one.
Here's an example of using a normal Lora instead, with Mei from Overwatch:
mei, sitting on car hood, (red sedan), brown eyes, brown hair, hair bun, hair stick, fur coat, jeans, black heels, hands on hood, crossed legs, road, trees, sky, mountains <lora:sitting_on_car_hood_01-000012:0.8> <lora:meiOverwatchLORA_v1:0.6>
I like the added touch of the snowcapped mountains and the adjustments made to the hoodie.
Another sub-use case here is regulating the interactions between characters. Here's another new image (which required some inpainting work to correct):
ballroom dancing, from side, woman, pink dress, choker, dark brown hair bun, blush, back, man, white coat, short black hair, black tie, black pants, belt, theater stage background, bokeh
I was able to combine this with the initial example to get this example:
This is much harder to get right due to all the concept bleeding, but it can definitely be helpful.
It should also be noted that there were a lot of close variations, inadvertently making this into a sort of "happy couple generator" Lora:
Last but not least, you can use these techniques to bootstrap a single image into a complete dataset, suitable to train a full Lora. For instance, taking the close up shot of the red woman with the face mask I was able to generate these images suitable for a dataset:
The last one required some inpainting to fix up.
Putting these into a new dataset with the original allowed for more aspects of this character to be defined and strengthened the characteristics that I focused on keeping consistent.
Here are some second generation images:
These were more consistently correct than the first generation. By continuing to iterate and carefully selecting new data, eventually there would be enough data for a full dataset with all the relevant aspects of this character.
Single Image Dataset Loras are a powerful tool and you should put it into your toolbox. Much like ControlNet, this opens up a whole new dimension of control over your image generations. The cheapness with which you can train new single image dataset Loras means that you can create them for all sorts of very specific reasons, or just whenever you feel inspired by an image.
They are also wonderful laboratories for experimentation, since you can see exactly what effect any given change has.
It allows you to operate when all you have is extremely limited data: just one image.
Note: Prior Work
While I did my development on this independently, I did some searching just before writing this article and found that there was some previous work on this here on Civitai. This is not surprising to me as this concept is a logical conclusion of the underlying technology and is bound to gain traction eventually. Here's what I found:
The guides on training an OC LoRA with a single base image in particular take a deep dive into the dataset bootstrapping process, so if you're interested in more detail on that process you should definitely check them out.
That series also references the Eisthol's Daughter series of Loras which are similarly trained from a single image.
Here are the referenced models, so those without easy access to Lora training can try them out. That said, the individual models are far less valuable than the technique and I really recommend getting access to some kind of training program if you don't have one already.
I'll link to the later parts on captioning and regularization that I'm planning on writing here, once they exist of course.