Sign In

[SD3] Diagram of UNET and exotic merging methods (v6.95)

[SD3] Diagram of UNET and exotic merging methods (v6.95)

This article is oversimplified. If you want to know more, please visit my Github page. Links in the bottom session. My Github is updated daily.


v6.95: Edited the L2 chart in Chapter 8 for concept illustration.

v6.9420: Thanks mcmonkey now I can make the diagram of SD3. Spoiler: No more UNET!

v6.10086: UwU SD3 merger is out! (no models to merge yet) And there are 24 MBW layers for the mmDiT! (no more UNET) Waiting Diffuser support for visualization. Or I use this instead?

v6.1: Replace the L2 distance diagram with full blown 118 models.

v6: Added TGED (chapter 8 and chapter 0).

v5: Added TSD (chapter 7).

v4: Added TIES-SOUP (chapter 5), Git Rebasin (chapter 6). Meanwhile put it to new category "ML research".

v3: Added SDXL content (chapter 4), marked as v3 because engineering advancement still counts.

v2: Added SD2 content (chapter 0), and a few wiki links.

v1: Initial content (SD1)

Diagrams of UNET (SD1/SD2/SDXL)

  • Generated by torchview along with diffusers. See the python notebook for actual generation code.

  • MBW / LBW layers are mapped manually, with layers name retrieved by safetensors. MBW layer name are just named by common practice with developers' intiution. You can ask author for verification.

  • "Module layers" can be more then a single lower tier layers, down to an actual neuron layers. See this JSON dump with a full layer list.

  • I don't know which layer draw hands and which layer draw anime styles either. Seek AI professionals for advice.




Known missing labeled layer: label_emb (which made MBW merger breaks)


No I won't update AutoMBW. Ask supermerger lol. Typical transformer.

Exotic merging methods

Full article in my Github article "AstolfoMix", and "AstolfoMix-SD2". I am not an AI professional, please always seek for processional advice. I will go ranting.

0. "All models are wrong, but some are useful"

This is "point 0" because it is not related to merge, but it is critical for merging models.

From the history of SD2, it is a disaster. Since WD1.4 and 1.5, until ReplicantV3, I suspect malfunction of both trainer and runtime, along with wrong configuration on the trainer (and slightly controversial tagging approach), making most models of the era is unuseable.

However, with a core discovery of "Replicant-V3 UNET + WD1.5B3 CLIP", as the result of tedious process of model selection, I quickly "test" all discovered models with seperated UNETs and CLIPs under controlled naive "uniform mix" of other components. After a few pass of storage consuming model sets with pattern recognition, finally I shortlisted 12 UNETs with 4 CLIPs, under total of 15 models, in a set of 24 discovered models.

The high diversity of model weights, including both nice and broken weights, merging them may yield glitched imges and even break the merger.

For SD1, it is a lot better. Just keep reset the CLIP to SD's original CLIP, and choose UNETs freely. If you need "trigger words", use LoRA instead. If you doubt why use SD's CLIP, because NAI use SD's VAE and CLIP also. Use this toolkit to replace.

0. "Just buy the haystack"

Applied since AstolfoMix-XL (TIES-SOUP).

This is also a "point 0", but different from my old filtering streadgy.

Thanks for the advancement of merging alforithm, we can finally accepts all model without worring the model weights contradicting together. The only requirement is making sure the valueable weights are the majority of the model set.

TIES did this by voting on sign movement, and Geometric Median, is a famous 51% attack solver.

You will see clearly Pony variants are spread in a corner, Kohaku D / E series are spread in other corners, and others are stay in a very dense cluster (my mixes are not included).

1. Uniform SOUP / Isotropic Merge

Applied to AstolfoMix (Extended).

It is just Ensemble averaging from 1990s, proposed by Polyak. As simple as its definition. No hyperparameters needed.

Terms "Uniform soup" comes from Model Soup, meanwhile "Isotropic Merge" comes from Fisher Merge.

You can either merge with math series (1/x for x > 2, change ui-config.json for better precision):

"modelmerger/Multiplier (M) - set to 0 to get model A/step": 0.0001,

Or use parallel mege with weight = 0.5 throughout the process.

To make noiticeable difference in performance with stable visual style, I found that it takes around 6 models to stable, but somehow I found 20 models which can generate 1024x1024 images.

You can merge ALL the LoRAs in the Internet to hope a tiny SD1 can beat SDXL with this approach, but I don't have time to do so.

2. Bayesian merge / AutoMBW / RL

Applied to AstolfoMix (Reinforced).

All 3 concepts are related. Full article available.

Bayesian Optimization over MBW (a framework) as a variant of Reinformcement Learning (or more precise, multi-armed bandit). Bought to SD by s1dx. Somehow archieved by me with AutoMBW. S1dx use Chad score for reward, but I use ImageReward because it is more convincing. AutoMBW also support other reward model and optimization algorithms (and some fancy feature which are untested), but they are yet to be proven. However S1dx supports "Add diff", but my extension dosen't.

To perform RL, we make payloads (prompts / parameters / settings you think worth to learn), choose a reward model, and leave your GPU burning. It will output the "best merging receipe".

(Explaination is omitted) RL doesn't need dataset (and tedious preparation / preprocessing), and it is easier to archive along with multiple black box as environment. However the design of payloads is important, and it does overfit to payload and damage diversity. As discussed in recent LLMs, alignment between Reward models and human evalulations are also important. We shuold study carefully how such reward model (aesthetic score predictor in this example) is produced, and determine if such model is useful and correlate to the objective we want to archive.

Usually it don't need 256 H100 to do so, but it may takes a lot of time, with up to 120 iterations to archive early stopping, and the process is hard to be parallel. It took 3 weeks to train for my model, using 2x RTX 3090.

Parallel mege is expected, and all 20 models must be merged for desired effect. Feature selection occurs in an unpredictable mannar, and without obvious preservation. There is no "performance preview" until the very last merge. However with long runs, it tends to be "uniform merge" with optimized "direction", because the weight initialization is pure random by refault.

The reason it works (without either traditional finetuning or inventred "MBW theories") is simple: Feature extraction with 27 parameters in somewhat useful areas are already effective.

3. Special case of "Add difference"

Applied to AstolfoMix (21b).

This is the hardest part to understand. For "add difference", we can take the special case "b-c=a+1" to make "one more thing". Notation "a" is easy to understand: They come from same set of 20 models. and for the notation "1", it is the "adjusted direction".

This diagram may make it easier to understand:

It does preserve most of the "good features" and further "adjusted to the best direction" a.k.a balance of all 20 models. Such balance may show its strength in extreme conditions, such as DDIM 500 STEP with CFG=1 with absolutely no prompts. However this is a bit far from academic.

Here are the Github articles in my repo:

FULL receipes are included (a lot, talking about 20 models):

4. "E2E" merge procedure with FP64 precision

Applied to AstolfoMix-XL (Extended-FP64)

When the model pool is being really large (say 70), merging them by hand will be impossible, meanwhile casting model weights to FP16 multiple times will propogate the precision error, and managing recipe will be hard, therefore a dedicated merger is used to automate the process. All merges are done in memory with FP64 precision therefore such error can be prevented.

Such merge still used hours to obtain, and it will be days if it is done by hand. Also, the elimination of "precision error" is visiable in resulting image, and it can be propogated to a distinct image.

5. TIES merge and "TIES-SOUP" merge

Applied to AstolfoMix-XL (TIES-SOUP)

Don't be afraid of a paper reaching academic conference ("above arxiv"), I'll keep it in short: "Model parameters votes for their math sign to reduce noise".

My "TIES-SOUP" version is made because 50% by accident (it works), and 50% that the correct TIES implementation actually doesn't work, and I need to get it theorized again.

It successfully "dissolved" the poor Pony into the model soup, but it will forget half of random details (and most of the quality tags yay), and made "Untitled" art finally possible (theory is present but the prediction was too noisy). Also, I have found that it is no more associative, where covariance exists, so no model filtering has the best performance (which I did in previous merge).

If you found that you RAM is not large enough to handle many SDXL models (310GB for 73 SDXL models), build a proper workstation or just choose what you belief worth to merge (should be 0.8x of all model size).

6. Git Re-Basin: Is a merge

(No model published)

Referring this paper. The algorithm is too complicated to post here, and I found that there was only one correct implementation in the wild (better than nothing haha), and spent 4 hours to merge a pair of SDXL model. It is infeasible to find an algorithm requires evalulation of the model (a.k.a. T2I) or solving optimization problems in place. It was designed for VGG which is simple in structure, it may not work for SD / LLM which is complicated (parameter count with many components).

7. DARE merge and "TSD" merge

Applied to AstolfoMix-XL (TSD)

Don't be afraid of another paper reaching academic conference ("above arxiv"), I'll keep it in short: "Dropout And REscale". Dropout is a common technique while trainine neural networks. Picking the nerurons to disable is by pure random without looking for any meanings.

My "TSD" version is made because 50% by accident (it works again), and 50% that the correct TIES implementation actually doesn't work, and I need to get it theorized again.

The effect is similar to TIES-SOUP, because it is just an additive above algorithm. It takes 450GB for 102 SDXL models with limited threads (merge 16/2515 layers at a time). BTW, 10% droupout rate is preferred, paper's 90% are just showing the properties in GPT instead of SD.

Also, since randomness is introduced, changing seed also changes the final result (hint: livery).

8. Model Stock and "TGMD" merge

Applied to AstolfoMix-XL (TGMD).

Yes, I read the paper. No, I'm confused. I know it is trying to find a center, but cosine similarity (for 2 vectors) just doesn't comply with 100 SDXL models. Even the paper doesn't clearly state what kind of "center" it is. Averaging is already finding the centroid of the models i.e. a kind of center. So what I can think of is a kind of median... Maybe geometric median.

As soon as trying the GM against mean / centroid (a.k.a. average), I found that it is effective. Then plugging it into current DARE / TIES stack, yes it works. Here is the current modification.

(This may be inaccurate) This is the concept illustration on how and why it works:

You can read Chapter 0 for why GM works. Since SD3 is releasing on June 12, I want to rush before I have absolutely no chance to be noticed, even I don't think there will be another 100 SD3 models for me to merge.