The next step of Kohaku series Model

SD3, HunYuan or Lumina

Some ppl may know that kohya and I are working on the trainer implementation of SD3 and HunYuan and thought that I may train Kohaku SD3 or Kohaku HY. But unfortunately, this 2 base models have some unignorable shortcomings.

SD3 have well-known anatomy issue, HunYuan need near 2~3 times time for training it (with same sample seen compare to SDXL).

Both models are not the perfect choice for me. I have said I will give HunYuan or SD3 a try. I will, but not in recent.

(For lumina, Gemma2B is bad choice for me so, pass. I may even prefer SD3 more than Lumina)

What I'm doing

Since I have seen some weird comment on forums like 4chan saying something like "Kohaku is wasting his time on shit things"

I think I may need to explain what I'm doing when I stop publish model for months.

Dataset

Basically I'm preparing a larger dataset (totally 15~25M data, and I will choose 7.5~15M to train) consist of Danbooru, E621 and Pixiv(for some specific concepts) images. I'm also making some better utils for dataset construction (like preprocess pipeline, more general hakubooru).

Beside collecting images, I'm also doing natural language captioning for anime images now. I may caption 5~7M images and open source for these dataset are also planned.

Also, CyberMeow and I are working on GBC10M recently, which could be helpful for image-text related task, check it if you are intersted.

DanTagGen/KGen project

After training the Kohaku XL Delta and Epsilon. I have found a serious problem about current situation: The good captioning for training is not good for using.

Basically, use more complex, well formed caption to train the model can get better result. But it is hard to require user to use that complex prompts. So DTG came out.

Since I'm building larger dataset with more captions with better quality/diversity. I will also focus on training better "DTG" for that.

(Fun fact, the "shitty things" said by random 4chan user is DTG, and DTG is way more popular than my model LoL)

Next Step

Kohaku series model

After I finish the dataset and "DTG" things. I will start training my next version of Kohaku Series model. It will not be any new arch but still the SDXL one. This one is for experiments on my new dataset and new "DTG".

Pretraining

I have pretrained some Class cond or text cond diffusion model before for research/study purpose. But I never tried to pretrain a general T2I model before. I'm planning to pretrain a T2I model and share all the things I learned. This may take lot of time but definitely worth it. At least not "shitty things".

About open source

Unlike some "well-known" models. I will still open source all the dataset, code I used. Maybe not exactly same (like for pixiv dataset I may only release the image cdn url and metas), or not the full dataset I have in my disk.(like dataset private source) But definitely the dataset I used for training Kohaku-series models.

If you have any question or suggestion. Join my discord:

https://discord.gg/tPBsKDyRR5