Sign In

SDXL-Aleph - QWEN TE SDXL Flow Matching

0

SDXL-Aleph - QWEN TE SDXL Flow Matching

As discussed the prelims were successful

https://huggingface.co/AbstractPhil/geolip-sdxl-aleph

With a catch, the CFG aka classifier free guidance I trained into SD15-Flow-Lune doesn't work particularly well with SDXL yet.

Independent model weight code

https://github.com/AbstractEyes/geolip-sd-trainer

The code itself is here. A legitimate standalone pytorch trainer rather than depending on diffusers, peft, or any other major set. Operates cleanly from here.

https://github.com/AbstractEyes/geolip-sd-trainer/blob/main/geolip_sd_trainer/model.py

Training loop included but not perfect.

https://github.com/AbstractEyes/geolip-sd-trainer/blob/main/geolip_sd_trainer/trainer.py

86,000 QWEN image dataset and prompt

Sourced synthetic-only data from QWEN, full dataset open source and free.

https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase0

Everything cached here, full dataset open source and free.

https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase1-cache

I anticipate less than a 1% full saturation with such a small dataset. However, the results will be obvious.

Epoch 5 and epoch 10 will be available for playing with. Compared to the earlier sourcing that employed high optimized bucketing and utilized many many variants of data from many datasets, this one is a bit limited. However, it is cooking.

We feed QWEN directly instead of CLIP_L sequence and CLIP_G pooled, scaled down to fit the CLIP_L behavioral slot.

This is NOT ENOUGH data by a large margin. It's a good pretrain startup, but not enough to saturate the model. It will take many more images.

Low CFG only for now, or none.

The display image there is CFG 1, meaning the model is capable at CFG 1 for now. The same process I trained SD15 is being employed here, but it's going to take about a week to fully mature the model.

SD15-Flow-Lune proven flow matching formula

The formula was proven through Lune and the process refined over multiple iterations until the current formula was refined and created. The structure is sound, the numerics battle tested, and the alephs in place to preserve the numeric skeleton of SDXL without complete collapse. The model is cooking.

ETA 10 epochs - Saturday Night

It'll be ready to play with tonight, I'll get it formatted for comfyui use but I doubt it'll be simple so it probably won't be directly usable until Sunday or potentially during the week or next Friday. It's not as easy to implement something as complex as Qwen 3.5 0.8b in Comfyui, as it is to simply import the libraries elsewhere and hook it together.

ETA ComfyUI - ASAP

Early priority solution. Likely by tonight before the model even comes out it will be ready, but possibly longer.

ETA 60 epochs 1 week

Roughly. For the full 60 epochs.

NOT surge trained procrustes

The Alephs weren't fully ready for surge training, so I can't flood the model yet with Flux data. It will be as soon as I solve the last edge case formula exceptions that cause corruption at scale.

Aleph Tests show positive yield

I chose clip_l replacement with clip_g pooled replacement for stage 1.

It shows that swapping the full clip_g simply can't be supported with QWEN 3.5 0.8b yet, however I have a plan for implementing that in the near future which will include a json interpretation of images as well as multiprompt interpretation.

Three-prompt basically.

Prompt 1-3

Prompt1

  • Plain English, long prompt.

Prompt2

  • JSON translated plain English, short prompt

Prompt3

  • JSON translated plain English, extended detailed subject association prompt

The reasoning is straightforward as to why it will be this. Simple prompt for sequence, simple json for subject association, long pooled prompt for high fidelity topology association.

This will pave the way for high fidelity long prompt sequence 2.

Final goal being; [512, 1024], [512, 1024], [1024]

Which will be a 512 sequence SDXL with direct rotary attention control.

Json LORAS for Qwen

https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_1-lora/blob/main/trainer.py

This is the trainer they were trained with, I think. I should really have made a readme.

https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_1-lora-v2-stage1

Here is the json lora that allows Qwen 3.5 0.8b to represent direct subject association json from image or summarize the json from pure raw text caption inputs.

https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_2-lora

Here is the lora that allows Qwen 3.5 0.8b to express output in plain generic SYMBOLIC representation. fancy dinner party table becomes [TABLE] and so on. Everything generically represented through short symbolic representation prompting.

These make this possible

If I extract the last hidden states for final token pooling, and I grab the full sequence representations as I go; I can grab the same representations doing 2shot inference, as the current system does to refine the prompt - except with excessive amounts of json preparation. In this case we 2shot base 3.5 0.8b, then we sample again giving our loras for json actuation and preparation. For image learning we then feed it through the vit and get a proper prompt and combine the three together, or potentially superimpose stronger behavior from one or the other.

In direct association with florence, llava, blip, animetimm vits, and other vits of this nature; this system can caption hundreds of thousands of images per day on a single RTX 6000 pro, allowing for massive amounts of captioned images in a short period of time.

Caching Optimizations

Qwen 3.5 0.8b supports high accuracy caching for inference. The caches become invalid for the json loras, however the baseline variation can handle caching. I have yet to fully understand this process, so I'll be reviewing the necessary components, discovering the formulas, and finding the best courses of action to cache next token prediction for a smooth system of operation in ComfyUI.

0