Every TI Has Its Own Personality: My Experience in Embedding Making - Part 1

How time flies! It's been almost a year since I first tried to make a TI.

At first, the TIs I made were ugly and terrible, just like most people who just started making TIs.

In addition to constant experimentation, I sought advice and guidance on Reddit and Civitai, observing discussions from others. I received kind and patient responses.

Recently, after I started posting some of my TI creations on Civitai, many people have been asking me various questions about TI making, including my training parameters. Some people have even suggested that I write a tutorial.

No, I feel I am far from reaching the level where I can write a tutorial on this topic.

Most of the so-called 'tutorials' online are nothing but garbage copied and pasted from various places, containing all sorts of errors and full of holes. I'm not interested in doing something like that.

But since many people are asking, it's not good to not respond at all. So let me share some of my insights with everyone. I'll write as it comes to me. Today I'll write a part of it, and perhaps there will be more to come in the future.

It's just some of my personal experiences and thoughts. I'm not a professional, so there may be many errors and cognitive biases in the following content. Please excuse me.

Below, when I mention TI, I'm talking about the ones trained specifically to generate images of particular individuals. Basically, those are the TIs categorized as "celebrity" on Civitai.

Furthermore, whether something looks like or not, whether it's good-looking or not, these are all highly subjective and personal. Everyone's opinion will be different. If I think it looks like someone, and you say it doesn't; if I think it's beautiful, and you say it's ugly.
Okay, you're right. I won't argue with you.

Okay, let's begin.

1. Is making LoRA easier than making TI?

Yes.

I've made both types, and many people, including myself, believe that LoRA is easier to achieve the goal of 'looking like' the target object.

But I still prefer TI.

2. A1111, Kohya, or some other tool?

I've only used A1111 and Kohya. Both tools can successfully make TIs based on SD v1.5.

I use A1111 more often and prefer it. I find it simpler and more convenient.

A few months ago, I used A1111 versions 1.5, 1.6, 1.7, then recently 1.8, and now I'm using 1.9. All of them can train normally.

3. SD1.5, SD2, SDXL, or SD3?

A1111 currently only supports TI training for SD1.5.

I haven't tried making TIs or LoRAs based on SD2.

Although Kohya supports TI training for SDXL, I've never succeeded. I've attempted to train TIs for SDXL on Kohya multiple times, but have never achieved satisfactory results.

I hope in the near future, there will be effective TI training tools for SD3.

4. What is the most critical factor determining the success or failure of TI training?

DATASET.

I spent most of my time preparing the dataset and adjusting the dataset.

With a good dataset, even if the training parameters are not very suitable, you can still get decent training results. On the contrary, it is impossible to succeed.

5. What is a good dataset?

I don't have a simple, standard answer to this question, but I have a lot of experience with it.

It is precisely the dataset that shapes the 'Personality' mentioned in the title of this article.

The answer to this question involves many different aspects, and every TI creator has their own insights. Perhaps I will write a dedicated article on this issue in the future.

The following few small questions are all related to this, and can be considered answers to the simplest aspects involved in this question.

6. Do the training images have to be 512x512?

NO.

512x512 is just a simple and easy choice. You can also try other different sizes, such as 512x768, 768x768, and so on.

You can also include images of different sizes within one dataset. Both A1111 and Kohya can perform 'bucketed training' on images of different sizes.

For example, my Anna Kendrick SoloTI, Scarlett Johansson SoloTI, etc., are trained using images of different sizes.

Anna Kendrick SoloTI

Scarlett Johansson SoloTI

7. How many images should be included in a dataset?

I don't think there's a single fixed number that's the only best choice.

The number of images you need depends on various factors such as the training goals you want to achieve, the characteristics of the target object, and other relevant factors.

I've achieved good results with datasets ranging from 10 to 200 images.

Although different numbers of images can work, for beginners, I think the recipe of 15 images proposed by JernauGurgeh is a very good start. I started and succeeded by following his recipe of 15 images.

For example, my Dakota Johnson SoloTI, Erin Moriarty SoloTI, etc., were trained using 15 images each.

Dakota Johnson SoloTI

Erin Moriarty SoloTI

Meanwhile, j1551's recipe of 30 images is also good.

For example, my Kristen Stewart SoloTI, Anna Shcherbakova SoloTI, etc., were trained using 30 images each.

Kristen Stewart SoloTI

Anna Shcherbakova SoloTI

And there are many other TI training datasets with different numbers of images, such as:

Amanda Seyfried SoloTI 27 pics

Alexandra Daddario SoloTI 66 pics

Anya Taylor-Joy SoloTI 90 pics

One point I want to remind everyone is: based on my experience, the more images in the dataset, the harder it is to get good training results.

Faced with numerous images, your ability to select images will be put to the test.

8.Besides resizing, what other preprocessing are necessary for images?

Well, it depends on the condition of your images and the goals you want to achieve.

For me, I prefer to use Photoshop to crop and process images one by one as needed, including adjustments such as brightness, contrast, color, and so on.

Sometimes it's necessary to remove certain elements from the images, such as extra people.

Confronted with those unretouched close-up photos, I even picked up my photo retouching skills that I had practiced many years ago.

Remember: moderation is key. Avoid over-preprocessing unless you're aiming for a specific effect.

9.Is it necessary to remove the background from the images?

In my experience, there is no need to do that.

10. Are caption files absolutely necessary?

No.

When training TI models with Kohya, I usually don't use caption files. But when using A1111, if there are no caption files, I feel the training results will be worse. So even if I have a dataset of 200 images, I still diligently check and modify each caption file.

There are many more aspects about the dataset that could be written about, but I won't go into detail here.

The number of training images, the selection of different types of images, the degree of pre-processing applied to the images, combined with appropriate training parameters and so on, together shape the personality of a TI model.

This personality is reflected in the fact that different TIs will give different responses to the same prompts, resulting in noticeably different generated outcomes. Some TIs will give specific results for certain specific prompt content. The more images you generate using a TI model, the more apparent this "personality" trait becomes.

I believe the reasons causing this situation are, firstly, the selection of the dataset, and secondly, the generalization and related capabilities resulting from the training process.

Let me give two interesting examples:

Example A: My Emma Roberts SoloTI

I found that when using exactly the same prompts to generate images, this TI produces more diverse and intricate clothing combinations compared to other TIs.

Firstly, it's due to the dataset. This was a dataset of 30 images. When I review that dataset now, I can immediately understand where this TI's good "dress sense" comes from. Of course, it stems from Emma Roberts' good dress sense.

Furthermore, this TI is so huge at 114 KB, which is not only the largest among all the TIs I've made, but also one of the largest among all TIs in the Civitai Celebrity TI category.

Within its vast body, there must be a wardrobe room hidden.

Example B: My Tifa Lockhart SoloTI

Mainly due to the dataset, using this TI under certain prompts often leads to the characters wearing tank tops or bodysuits.

11. How should the training parameters be set?

The setting of training parameters is the most frequently asked question. It was also the question I asked the most in the beginning.

Here I want to reiterate: the most important thing is the dataset, while the training parameters are relatively secondary. If you have a good dataset, setting the training parameters won't be difficult.

What I want to tell everyone is that the settings for these common training parameters that are circulating online are all feasible. I have achieved good results using those settings. Some of the TIs I have released use those settings. At the same time, I have also failed using those settings.

Using A1111 as an example, let's briefly discuss the settings for some common training parameters.

Learing rate

The smaller the learning rate, the lower the learning efficiency, but usually it learns more meticulously.

The optimal setting still depends on the dataset as well as the batch size and Gradient Accumulation Steps.

The learning rate can be set to a fixed value, or within the range of fluctuating values mentioned in various online tutorials.

I usually choose a learning rate of either 0.003 or 0.004.

For me, in general, a learning rate lower than 0.003 is meaningless.

Batch size(BS)

I've tried settings from 1 to 8. Generally, depending on the dataset, I would choose 1, 3, or 5.

A setting of 1 is the safest option. If even a setting of 1 doesn't yield good results, then it's not a batch size issue.

Gradient accumulation steps(GA)

I've tried setting the value from 1 to 60.

Warning: The larger the value of this item, the slower the training speed!

Similarly, 1 is the safest option. The appropriate maximum value depends on the dataset as well.

I usually don't set this item's value to exceed 8, but there's one exception that I'll mention later.

Also, I prefer BS × GA ≤ DATASET.

For example, when the dataset contains 20 images:

1 × 1 < 20

2 × 5 < 20

4 × 5 = 20

Max steps

The setting of training steps is closely related to the dataset, learing rate, batch size, and gradient accumulation steps.

Generally, the setting of this item follows the following rules:

The larger and more complex the dataset, the more training steps are needed.

The smaller the learning rate, the more training steps are needed.

The smaller the batch size and gradient accumulation steps, the more training steps are needed.

Without considering the complexity of the dataset, my usual setting is:

Max steps ÷ (DATASET ÷ (BS × GA)) = 200

Let's still take the example of a dataset with 20 images:

4000 ÷ (20 ÷ (1 × 1)) = 200

400 ÷ (20 ÷ (2 × 5)) = 200

200 ÷ (20 ÷ (4 × 5)) = 200

Number of vectors per token

The most direct outcome of this setting is the size of the TI file. So, you can infer this setting from the size of a TI file.

My usual setting is 8 to 48, with 8 or 16 being safe options.

The specific setting also depends on the number of images in the dataset, the complexity of the dataset, and the training objectives.

12.Is there a universal, always viable training parameter recipe?

For me, no.

Even with the exact same dataset and training parameters, on the same device, each training session will yield different results.

My training process typically involves:

if the first training results are good, I don't make any changes or make slight adjustments to the training parameters and train one or two more times. Then, I select the one I consider the best from the results.

If the first training results are unsatisfactory, then adjustments or replacements to the dataset are needed. I adjust the training parameters accordingly and try again. If still unsatisfied, I continue adjusting.

Because each dataset is different, there is no fixed training parameter recipe that will always work for me.

But for beginners, I have a recommended recipe, which is the 'exception' mentioned earlier.

This recipe is not my invention, but rather the recipe previously published by JernauGurgeh.

To put it simply:

15 images, batch size (BS) = 3, gradient accumulation (GA) = 15

I think this is the easiest way for beginners to achieve success, and since it only requires 15 images, each training session is very short, so you can get results quickly and then adjust and experiment further.

This is the method I used to train my first successful TI. On a 3080 GPU, each training run takes only around 5 minutes.

When using this method, I recommend setting the number of training steps to 200.

One thing to note is that I've never achieved consistent good results with this formula on Kohya as I did with A1111. I suspect that the setting GA=15 might trigger some special processing mechanism in A1111. Of course, this is just a suspicion(I haven't carefully reviewed the training code for A1111.) and doesn't affect anyone's use of this formula on A1111.

If you have already achieved success with JernauGurgeh's method, then j1551's method is also worth trying:

30 images, batch size (BS) = 6, gradient accumulation (GA) = 5, Steps = 300

If you are also successful with j1551's method, and you still have the interest to continue exploring, then you should have many of your own ideas and insights that you can freely try out.

Okay, I'll stop here. I hope this has been helpful for beginners.

I want to emphasize again that these are just my own experiences and thoughts. I'm not a professional, and the content above may contain numerous errors and cognitive biases. Please understand. I also ask that experts not mock or ridicule me.

Judging from both the discussion volume on Reddit and the download counts on Civitai, once highly popular TIs are gradually being forgotten as the number of SD v1.5 users decreases. Many new SD users only know about LoRA and not TI anymore.

I enjoy TI just like other friends who are still using TI. I believe that once there is an efficient TI training tool, TI will still be the preferred choice for a large number of users in the era of SD3.