Turning EPS Model into VPred Model, and back to EPS "full finetune" (v2)

Ref: My github article, this, this, and then this. The "Karmix" refers to this, the author decided to release here.

This is a progress report more than a tech report. Meanwhile, I admire that this time I'm relying on my art sense instead of my AI / ML knowledge ~~you won't see equations here~~.

Warning: You won't understand the entire article unless you have read my previous articles, or tried my models, or just experience with my merging / training techniques. You may able to proof that I'm not relying on LLM (vibe research) but these are all original contents and theories. Meanwhile it is more on general AI ML rules instead of specific papers in arxiv. You may use LLM to "decompress" my contents, and hopefully gather some concepts in between.

Changelog:

v2: It works. BTW no Astolfo for cover for SFW content.

v1.1: Update the "comming soon" session because it works better than expected.

v1: Initial content.

1. Problem: Turning EPS model to VPred model is hard.

Applied to AstolfoCarmix-VPredXL (AK-NIL1.5 Based).

The original paper was just casually discuss the math equations, then suddenly it became the SD2.0 which was DOA, and the trainer bugs made almost all the downstream finetunes DOA also, until the "zt-SNR" patch (math patch) and NoobAI use other working trainers.
Despite there are a few SDXL non NoobAI Vpred models, which may requires customized code to run, it is not intuitive to "convert" (or finetune to predict other objective variables) because there may be dataset requirement.
After a dozen of trials, especially revisiting "eps-eps" / "vpred-vpred" for validating bug-free codes, here is some key conditions I have found.

Found conditions to turn EPS model into VPred model

Warning: Not vigirously proven!

Subset of the EPS dataset, at least aligned for a large extent. This includes guessing what images has been dropped, and the caption text behind the image. One of the easiest route is the 1girl AI Slop from that model. My friend just used 1k slops to get it converted, leaving me failed with 6k images, even 12.4M images are far worse. Sure, model collaspe may happen, but it really needs a good bias for the new math variable. Otherwise, you are actually pretraining the entire model in math level.
(Not verified by me) Train only part of the model as OUT08 and OUT02. This more on believing on MBW magic. This is heard from NoobAI Discord Server.
Legit training progress with evaluation. Image can break, but cannot turn back to abstract art. Loss curve has no correlation to the learning progress. It is fine if you see the distortion is limited into color fragments. You should stop training once it lost shapes, which can be tested in early stage.
Merge with base model. If you cannot obtain the dataset for the base model (e.g. original dataset from Pony / NoobAI), just don't be purist on model training. You need to manual search for the optimal hyperparameter.
VPred Trainer and WebUI working as intended. Notice that ComfyUI has different behaviour over A1111 and ReForge, which ComfyUI applied EPS at some moment. I prefer test with A1111 since it relies on original codes from StabilityAI (hence the crafted yaml file).

If you are going big and making another SDXL VPred finetune, I have a bad news for you. Enjoy the glitched images. Therefore I quit the VPred conversion until I have my next 1EP done. After a few tries, "task failed successfully" with a solid NaN.

2. Testing the hypothesis: Back to EPS training.

Applied to AstolfoKarmix (AK-Evo 2EP).

Full gallery. It is actually boring.
(Another) 779k steps in total.
Furry / Realistic image regained, generally outperform current AC.
It is happy to scale up with highres fix, which is approx 1024x1024 x2.0, outperform the AstolfoMix SD1.5. It can max out your VRAM in 3090.

Then applied to AstolfoCarmix (AC-Evo 2.5EP)

Full gallery.
The training process was highly uncertain.
Since dataset alignment has been archieved (even they are just 1EP), with careful monitoring, the model survived again.
However, accident happened, and I have another model to release. See next session.

Since I don't have the "1girl dataset", I went 57EP for Astolfo 6k dataset instead. It is still fine since he is still an (cute) character.
However, for non human content, it won't appear obviously (not completely won't appear, if there is correlation, it will slightly learnt).

xyz_grid-0136-2369463181-8064-1081-4-48-20260219011326.jpg

3. Accident happened. Suffer the "non 0 null uncond" issue.

Applied to AstolfoVL (2.5EP).

The accident was caused by the "non 0 null uncond".
The CFG paper was expecting "empty set as uncond", A1111 and reForge just took the shortcut by default. Meanwhile, ComfyUI took no shortcut and expects ConditioningZeroOut.
Most models (11k out of 11k, really) don't care, since any legit finetuning will make "null uncond" eventually "zero", as long as you dump enough GPU hours. However, I just swapped the obective (eps to vpred) with only 0.5EP (although it is already 1 month with 4x RTX 3090). Astolfo is different in this case.

Therefore, to remove the shortcut, meanwhile make the "non null uncond" close to "null uncond" as "non zero uncond", I place a single "," as negative prompt.

Extra: Missing "vpred" and "ztsnr" layers

As stated in announcement, no, I won't add the placeholder layer to make it looks like some anime models. Astolfo is different.
I made the workflow just for the "non 0 null uncond", not for auto detection.