As discussed the prelims were successful
https://huggingface.co/AbstractPhil/geolip-sdxl-aleph
With a catch, the CFG aka classifier free guidance I trained into SD15-Flow-Lune doesn't work particularly well with SDXL yet.
Independent model weight code
https://github.com/AbstractEyes/geolip-sd-trainer
The code itself is here. A legitimate standalone pytorch trainer rather than depending on diffusers, peft, or any other major set. Operates cleanly from here.
https://github.com/AbstractEyes/geolip-sd-trainer/blob/main/geolip_sd_trainer/model.py
Training loop included but not perfect.
https://github.com/AbstractEyes/geolip-sd-trainer/blob/main/geolip_sd_trainer/trainer.py
86,000 QWEN image dataset and prompt
Sourced synthetic-only data from QWEN, full dataset open source and free.
https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase0
Everything cached here, full dataset open source and free.
https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase1-cache
I anticipate less than a 1% full saturation with such a small dataset. However, the results will be obvious.
Epoch 5 and epoch 10 will be available for playing with. Compared to the earlier sourcing that employed high optimized bucketing and utilized many many variants of data from many datasets, this one is a bit limited. However, it is cooking.
We feed QWEN directly instead of CLIP_L sequence and CLIP_G pooled, scaled down to fit the CLIP_L behavioral slot.
This is NOT ENOUGH data by a large margin. It's a good pretrain startup, but not enough to saturate the model. It will take many more images.
Low CFG only for now, or none.
The display image there is CFG 1, meaning the model is capable at CFG 1 for now. The same process I trained SD15 is being employed here, but it's going to take about a week to fully mature the model.
SD15-Flow-Lune proven flow matching formula
The formula was proven through Lune and the process refined over multiple iterations until the current formula was refined and created. The structure is sound, the numerics battle tested, and the alephs in place to preserve the numeric skeleton of SDXL without complete collapse. The model is cooking.
ETA 10 epochs - Saturday Night
It'll be ready to play with tonight, I'll get it formatted for comfyui use but I doubt it'll be simple so it probably won't be directly usable until Sunday or potentially during the week or next Friday. It's not as easy to implement something as complex as Qwen 3.5 0.8b in Comfyui, as it is to simply import the libraries elsewhere and hook it together.
ETA ComfyUI - ASAP
Early priority solution. Likely by tonight before the model even comes out it will be ready, but possibly longer.
ETA 60 epochs 1 week
Roughly. For the full 60 epochs.
NOT surge trained procrustes
The Alephs weren't fully ready for surge training, so I can't flood the model yet with Flux data. It will be as soon as I solve the last edge case formula exceptions that cause corruption at scale.
Aleph Tests show positive yield
I chose clip_l replacement with clip_g pooled replacement for stage 1.
It shows that swapping the full clip_g simply can't be supported with QWEN 3.5 0.8b yet, however I have a plan for implementing that in the near future which will include a json interpretation of images as well as multiprompt interpretation.
Three-prompt basically.
Prompt 1-3
Prompt1
Plain English, long prompt.
Prompt2
JSON translated plain English, short prompt
Prompt3
JSON translated plain English, extended detailed subject association prompt
The reasoning is straightforward as to why it will be this. Simple prompt for sequence, simple json for subject association, long pooled prompt for high fidelity topology association.
This will pave the way for high fidelity long prompt sequence 2.
Final goal being; [512, 1024], [512, 1024], [1024]
Which will be a 512 sequence SDXL with direct rotary attention control.
Json LORAS for Qwen
https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_1-lora/blob/main/trainer.py
This is the trainer they were trained with, I think. I should really have made a readme.
https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_1-lora-v2-stage1
Here is the json lora that allows Qwen 3.5 0.8b to represent direct subject association json from image or summarize the json from pure raw text caption inputs.
https://huggingface.co/AbstractPhil/qwen3.5-0.8b-task_2-lora
Here is the lora that allows Qwen 3.5 0.8b to express output in plain generic SYMBOLIC representation. fancy dinner party table becomes [TABLE] and so on. Everything generically represented through short symbolic representation prompting.
These make this possible
If I extract the last hidden states for final token pooling, and I grab the full sequence representations as I go; I can grab the same representations doing 2shot inference, as the current system does to refine the prompt - except with excessive amounts of json preparation. In this case we 2shot base 3.5 0.8b, then we sample again giving our loras for json actuation and preparation. For image learning we then feed it through the vit and get a proper prompt and combine the three together, or potentially superimpose stronger behavior from one or the other.
In direct association with florence, llava, blip, animetimm vits, and other vits of this nature; this system can caption hundreds of thousands of images per day on a single RTX 6000 pro, allowing for massive amounts of captioned images in a short period of time.
Caching Optimizations
Qwen 3.5 0.8b supports high accuracy caching for inference. The caches become invalid for the json loras, however the baseline variation can handle caching. I have yet to fully understand this process, so I'll be reviewing the necessary components, discovering the formulas, and finding the best courses of action to cache next token prediction for a smooth system of operation in ComfyUI.


