Type | |
Stats | 11 |
Reviews | (1) |
Published | Mar 3, 2025 |
Base Model | |
Training | Epochs: 2 |
Hash | AutoV2 629D9F8874 |
*Important, please read this before using the model because this is very experimental, as I am still trying to fine the optimal settings, and will slowly exit beta as the training become more stable
You can download the clip text encoder here: https://huggingface.co/suzushi/miso-diffusion-m-1.0
I will write 2 articles soon as well on the details of the model.
Miso Diffusion M 1.0 is an attempt to fine tune stable diffusion 3.5 medium on anime dataset. In comfy ui it uses as little as 2.4 gb vram without the t5 text encoder. This version is a step up from previous version (beta), trained on the same 160k image for 3 more epoch then fine tuned on 600k images for another 2 epoch. (2 was choosen as further training would cause it to generate more artifact and blurry images)
Recommanded setting, euler, cfg:5 , 28-40 steps, (denoise: 0.95 or 1 )
prompt: danbooru style tagging. I recommand simply generating with a batch size of 4 to 8 and pick the best one. It will struggle with hands and complex pose, you can add upper body so it doesnt generate full body.
Quality tag
Masterpiece, Perfect Quality, High quality, Normal Quality, Low quality
Aesthetic Tag
Very Aesthetic, aesthetic
Pleasent
Very pleasent, pleasent, unpleasent
Additional tag: high resolution, elegant
Training was done in 1024x1024, though since the model natively supports 1440, certain prompt would work on 1440x1440 as well
Training is done on gh200 with 96gb vram
Training setting: Adafactor with a batchsize of 40, lr_scheduler: cosine
SD3.5 Specific setting:
enable_scaled_pos_embed = true
pos_emb_random_crop_rate = 0.2
weighting_scheme = "flow"
learning_rate = 3e-6
learning_rate_te1 = 2e-6
learning_rate_te2 = 2e-6
Train Clip: true, Train t5xxl: false
Developing a base model is costly, so if you like my
work please consider donation, thanks a lot: https://ko-fi.com/suzushi2024