Sign In

Diagram of UNET and exotic merging methods (v2)

Diagram of UNET and exotic merging methods (v2)

This article is oversimplified. If you want to know more, please visit my Github page. Links in the bottom session. My Github is updated daily.


v2: Added SD2 content (chapter 0), and a few wiki links.

v1: Initial content (SD1)

Diagrams of UNET (SD1/SD2/SDXL)

  • Generated by torchview along with diffusers. See the python notebook for actual generation code.

  • MBW / LBW layers are mapped manually, with layers name retrieved by safetensors. MBW layer name are just named by common practice with developers' intiution. You can ask author for verification.

  • "Module layers" can be more then a single lower tier layers, down to an actual neuron layers. See this JSON dump with a full layer list.

  • I don't know which layer draw hands and which layer draw anime styles either. Seek AI professionals for advice.




Known missing labeled layer: label_emb (which made MBW merger breaks)

Exotic merging methods

Full article in my Github article "AstolfoMix", and "AstolfoMix-SD2". I am not an AI professional, please always seek for processional advice. I will go ranting.

0. "All models are wrong, but some are useful"

Featured in AstolfoMix-SD2.

This is "point 0" because it is not related to merge, but it is critical for merging models.

From the history of SD2, it is a disaster. Since WD1.4 and 1.5, until ReplicantV3, I suspect malfunction of both trainer and runtime, along with wrong configuration on the trainer (and slightly controversial tagging approach), making most models of the era is unuseable.

However, with a core discovery of "Replicant-V3 UNET + WD1.5B3 CLIP", as the result of tedious process of model selection, I quickly "test" all discovered models with seperated UNETs and CLIPs under controlled naive "uniform mix" of other components. After a few pass of storage consuming model sets with pattern recognition, finally I shortlisted 12 UNETs with 4 CLIPs, under total of 15 models, in a set of 24 discovered models.

The high diversity of model weights, including both nice and broken weights, merging them may yield glitched imges and even break the merger.

For SD1, it is a lot better. Just keep reset the CLIP to SD's original CLIP, and choose UNETs freely. If you need "trigger words", use LoRA instead. If you doubt why use SD's CLIP, because NAI use SD's VAE and CLIP also. Use this toolkit to replace.

1. Uniform merge

Applied to AstolfoMix (Extended).

It is just Ensemble averaging from 1990s, proposed by Polyak. As simple as its definition. No hyperparameters needed.

You can either merge with math series (1/x for x > 2, change ui-config.json for better precision):

"modelmerger/Multiplier (M) - set to 0 to get model A/step": 0.0001,

Or use parallel mege with weight = 0.5 throughout the process.

To make noiticeable difference in performance with stable visual style, I found that it takes around 6 models to stable, but somehow I found 20 models which can generate 1024x1024 images.

You can merge ALL the LoRAs in the Internet to hope a tiny SD1 can beat SDXL with this approach, but I don't have time to do so.

2. Bayesian merge / AutoMBW / RL

Applied to AstolfoMix (Reinforced).

All 3 concepts are related. Full article available.

Bayesian Optimization over MBW (a framework) as a variant of Reinformcement Learning (or more precise, multi-armed bandit). Bought to SD by s1dx. Somehow archieved by me with AutoMBW. S1dx use Chad score for reward, but I use ImageReward because it is more convincing. AutoMBW also support other reward model and optimization algorithms (and some fancy feature which are untested), but they are yet to be proven. However S1dx supports "Add diff", but my extension dosen't.

To perform RL, we make payloads (prompts / parameters / settings you think worth to learn), choose a reward model, and leave your GPU burning. It will output the "best merging receipe".

(Explaination is omitted) RL doesn't need dataset (and tedious preparation / preprocessing), and it is easier to archive along with multiple black box as environment. However the design of payloads is important, and it does overfit to payload and damage diversity. As discussed in recent LLMs, alignment between Reward models and human evalulations are also important. We shuold study carefully how such reward model (aesthetic score predictor in this example) is produced, and determine if such model is useful and correlate to the objective we want to archive.

Usually it don't need 256 H100 to do so, but it may takes a lot of time, with up to 120 iterations to archive early stopping, and the process is hard to be parallel. It took 3 weeks to train for my model, using 2x RTX 3090.

Parallel mege is expected, and all 20 models must be merged for desired effect. Feature selection occurs in an unpredictable mannar, and without obvious preservation. There is no "performance preview" until the very last merge. However with long runs, it tends to be "uniform merge" with optimized "direction", because the weight initialization is pure random by refault.

The reason it works (without either traditional finetuning or inventred "MBW theories") is simple: Feature extraction with 27 parameters in somewhat useful areas are already effective.

3. Special case of "Add difference"

Applied to AstolfoMix (21b).

This is the hardest part to understand. For "add difference", we can take the special case "b-c=a+1" to make "one more thing". Notation "a" is easy to understand: They come from same set of 20 models. and for the notation "1", it is the "adjusted direction".

This diagram may make it easier to understand:

It does preserve most of the "good features" and further "adjusted to the best direction" a.k.a balance of all 20 models. Such balance may show its strength in extreme conditions, such as DDIM 500 STEP with CFG=1 with absolutely no prompts. However this is a bit far from academic.

Here are the Github articles in my repo:

FULL receipes are included (a lot, talking about 20 models):