Flux1d2 Training Guide: Making base models and merging loras hurts.

Proof of concept:

https://civitai.com/models/863604/simulacrum-v3-flux1d-dedistilled-sfwnsfw

Preface

I did manage to merge and even formed a process to do it consistently, but it isn't a simplified nor even close to an easy process to do. It's about a 14 step process fraught with errors, pitfalls, and layers of required testing due to the inconsistency in the outcomes, in order to to get a base model merged with it's lora components and onto a runpod for functional kohya_ss training. Not to mention the size is absolutely atrocious, so I'm sure my ISP loves me.

You probably need more than 64 gigs of ram like I had. Everything got faster and smoothed out when I upped to 128 gigs ddr4 ram. FYI devs that's a thing.

Trainers

There's currently 3 popular training programs that I've seen:

kohya_ss < sd-scripts
onetrainer
simpletrainer

Okay so each has their own quirks and own problems. Each problem set has their own sub problems. The majority of them actually inference together though without any issues, and I was diving down the rabbit hole of why the other day when I realized the math involved is beyond my understanding. It seems to rely on a series of chaos theory concepts concatenated into a form of mathematical path and pattern recognition analysis to actually force the thing to inference with them. I'm not going to ****ing port that, nor am I going to write that.

Inference

There are three distinctly primary systems of inference that I've seen:

The base FLUX inference system, which uses their in-house scripts and isn't very flexible. Some of these were adopted by other systems, but the majority have been rewritten somehow or other to inference.
ComfyUI:
- has an in-house implementation of Flux inference that, quite frankly, runs terribly. Even on a 4090 I get tons of lag and problems with my browsers if I'm running anything alongside it. ComfyUI is a ram hog, and you have to configure it in a multitude of ways that reduces performance heavily to mitigate this.
- SD35's implementation seems to have compounded the lag and made it worse, not to mention SD35 seems lesser than flux at first use, but that's just a first impression.
Forge/A1111:
- the inference system is very fast, the vram can be manually adjusted
- vram control at runtime heavily improves performance overall which allows me to inference with multiple other loras and multiple models in a short time due to being so optimized and quick.

Merging

I've found a few methods of doing this, and I'll go through them here.

Merging LORA UNETs.
- Now you may think this is the easy part, and you're kind of right, but it's not. The merge systems are commonly one of two; the concatenation system, and the full merge system.
- Kohya_SS:
  - Merging to checkpoint will not save the unets and te encodings alongside, the te encoding blocks are simply ignored from loras trained with te encoding blocks.
    - This means the TE effects aren't going to concatenate with a baseline system simultaneously.
    - Trying to merge something into a comfyui concatenated model produces an error, and it always has produced an error.
  - The TE is simply omitted, which means you have to cowabunga it some other way, hoping you matched the correct weight values, and hoping whichever other system you're using is actually specified into allowing this.
  - When you DO manage to concatenate;
    - Concatenation often causes literally nothing to happen when merged with a model, either their weights were too low and somehow juiced up by inference originally to produce the images, or the output of the images simply don't conform to the inference you've been testing and using.
  - Full merge blends all the blocks together and the outcome produces very bleedy and overlapping effects. There is a lot of problems with this, and you can tell just by looking at it, that the SDXL bois n grills have solved many problems similar to this in the past with the large array of merge options and choices available.
- ComfyUI:
  - Much more effective at merging clip_l since you can control the strength of the TE for the clip_l, but you cannot save the unet separately.
  - Full merge to checkpoint requires extraction of the UNET if you want to train the model further.
  - USE THE METHOD BELOW, to create a usable compacted inference version of your base model below; this is for testing purposes.
  - DO NOT save the UNET itself through comfy using the save model node, it saves in an incorrect format, and the format is unlabeled. It's likely in an incorrect or outdated diffusors format but I have no way to convert this currently and I haven't researched how so I don't have any use for this.
  - There is no detection to determine quantization ahead of time.
  - Cowabunga dude. Good luck. Hope you remembered the 45 steps involved in making it work.
Merging TRAINED LORA CLIP_L and T5 blocks:
- Base models like Flux1D2 and FluxDeDistilled don't have a built in clip_l nor t5xxl, so you have to rely on something to concatenate them, or concatenate them yourself with a python script.
- ComfyUI merge is a viable option. You can load the loras and unet through the inference system, and then merge the clips with a concatenated model's clip, which is often a gradient difference for me.

Alright I'm cutting the description short since I have work to get to soon.

Full Kohya_SS runpod process post-setup

Download Flux1D2 from here or huggingface <- this is your training model.
Download FluxDeDistilled from here or huggingface fp8 <- if you are using this for inference.
Determine a ratio you want your loras merged at.
Calculate the full merge ratio based on the strength of the overall impact you want the loras to have, and then divide by the amount of loras.
- strength / lora count
- FULL MERGE for training AND inference. If it doesn't turn out good, try again. Don't bother inferencing Flux1D2, it looks like shit.
- So you have 3 loras, 1.2, 0.3, 0.73 strengths
  - 1.2 / 3 = 0.4 strength
  - 0.3 / 3 = 0.1
  - 0.73 / 3 = 0.24333333333333333333333333333333
    - round to 0.24
Set those three into Kohya and choose your base unet to merge to.
1. run for both your training model, and your inference model. Ensure you name the according to their version and their purpose, or you'll forget later. I have like 50 and I basically want to delete the folder at this point. I've defaulted to checking timestamps and comparing it to config files.
2. If you don't do it this way, the outcome will be inconsistent with the loras ran through comfyui and forge trained on the new model later. You are essentially discarding everything for this, so be sure to test it if it's what you want, and keep the training version on the side if it looks good.
3. NAME THEM.
Take your merged unet and then go to comfyui, and set the base tested ratios that you inference with for the clips, and then load the loras in correct sequence through the lora loaders.
Save the full concatenated checkpoint, and save the clip on the side.
1. run for both your training model, and don't bother saving your inference model clip_l.
You now have 3 parts; the unet generated from kohya_ss, the full checkpoint and clip_l generated from comfyui.
Zip them up and send them to unpod using runpodctl, easier if you set the config and the names in the config ahead of time.
1. ae.safetensors << flux1d specific
2. unet-v12345.safetensors << your merged unet
3. clip_l-v12345.safetensors << your merged clip_l
4. t5xxl_fp16.safetensors << your merged t5 if necessary
5. config.json << your configuration
Unzip after a large wait (even on fiber it can be upward of 10 minute wait).
Dry run the models with 1 image so you know it'll work. Debug.
Run your scripts like cheesechaser or whatever and get the necessary images, or load your dataset through runpodctl.
Run kohya_ss, generate samples, debug, edit, etc.
Get your completed model, rinse, repeat.