Preface
I will need to provide additional resources, information, training data, and more to make this determination a repeatable process. For now, this is the start of a larger article and paper.
I have included a multitude of used resources and links to datasets at the bottom in credits. I will be slowly adding papers and links to model weights over the coming days as I edit the paper.
Hypothesis
Timestep cross dimensional contamination divergence has been trained into my CLIP_L, and is beginning to show divergence in my new CLIP_G guidance model.
There are many ways to teach this in many different forms of many AI models; that provide a uniform utility with COS/SIN based mathematics that can provide rapid training on minimal hardware using intentional cross contamination.
I also hypothesize that training large models such as T5xxl with very specific guidelines, frozen CLIP_L/CLIP_G as teachers and non-bucketed images for gradient classification and caption vector association, will yield similar results as what happened with the CLIP_L and CLIP_G with minimal finetuning on low-end hardware in comparison to what they used to train the T5xxl originally.
Initial thoughts
I sat there watching image after image generate, over and over. I noticed a few patterns emerging but nothing too major with my Illustrious, SDXL, and early Pony trains. Just the basic linear generation patterns... the constant.
Flux showed up, and I was immediately interested when I managed to turn a banana into a house full of monkeys.
Once I got into training Flux, I tried to tackle the full beast and I dumped 10,000 images into it. At the time, this was a lot of images to me. I had no idea how few it really would end up being in the long run though.
After many, many failed trainings (nearly $5000 worth, I should have bought an l40); I discovered a series of settings that worked, and settings that did not work. Higher learn rate tended to result in deformities. Lower learn rate tended to result in floating body parts, objects, excess limbs, tentacles, and strange backgrounds or distortions... not to mention some incredibly disturbing NSFW imagery that I will never share; see my article about NSFW training for more information on this topic, but I digress;
The day that I started the Flux Shift training, my first version of F1D Sim trained using the Flux1D2 pro model, I noticed... some very strange things happening. Images would jump or jolt around, moving limbs to new places regularly in a seemingly deterministic way to hide certain details. The system would often force things to be places that I didn't expect them to be. Many other smaller things as well; hundreds or even thousands of small markers started to form patterns that led me to the thought process of offset depiction and the initial discovery of THE RULE OF 3.
However... the way that training systems like Kohya train them and iterate upon what they know, involves a series of gradient weight updates based on other gradient checks that were still at the time completely out of my scope and knowledge.
Given some legitimate time and training, I made a realization while training CLIP_L and CLIP_G Omega models; there is actually a layer of underlying weights tied directly to diffusion, but they weren't trained for this. They were trained to identify images.
The outcome has been a very slow burn. Something that I only noticed yesterday after all the testing, and I realized is something very different to my original goal; when I specifically took note of things appearing in HunYuan that SHOULD NOT exist; but were manifested due to the CLIP being so stubborn about them.
I don't know when it happened, I don't know exactly why, but I think I know HOW it emerged; and the process to recreate this timestep based concept in other models.
I consistently trained this CLIP_L with the same optimizer settings, the same very low learn rates, the same training patterns, no dropout, and as a rival to the T5.
I fed the thing nearly 33 million samples so far, give or take, and there are a few versions showing up rather than just one in the future when I finetune them specifically with the training parameters I'm developing now.
This timestep divergence may have existed before, or it may only be manifesting now; I don't know. There is very little official and unofficial documentation on this topic, and the papers themselves describe many of the traits, the models, the weights, and the classification purposes; but very little training has been done on this seemingly emergent behavior in these versions of clip.
It's not always reliable, and it's not always consistent; but it's manifesting more and more strength the more I train it using Flux, while the loss fidelity has only improved when mass training SDXL with CLIP_L.
The recent tests with the HunYuan base model show it can produce many things it should not be able to, which include character positioning on the screen based on time step positioning, screen control for objects, manifested genitalia (this is trained unchained) where none should be, gradient quality increases and decreasing for aesthetic tags, and even entire scene offset tags based on the situations and context; something the core model fails to accomplish WITHOUT this CLIP_L.
You can DIRECTLY line up the causal outcomes FROM the CLIP_L's responses, to the SDXL training before it, the Flux Schnell training before that, and the Flux 1Dev training before that.
Process of determination
Experiments;
Flux 1D SimV4 -> the core divergent CLIP_L
6 million sample -> 5 million frozen unet CLIP_L standalone training sample
Trained at the count of 225 tokens in KOHYA_SS for the majority of it's training diffusion training.
Flux began manifesting strange quirks in the CLIP_L very early. It impacted training in very negative ways at times, and more often was completely overridden by the dev core guidance; likely defined pre-training or during coding as a loose or small gradient.
Smaller gradients tend to be ignored, and the majority of the Flux1D distillation is heavily baked into a form that is SUPPOSED to resemble the parent model, but this doesn't always manifest in the outcome obviously.
Enough training managed to make it cohesive and cooperative, but it took a large amount of training and time on a very low cook rate; otherwise the entire model would shatter like glass floor under a stampeding rhino.
Before the release of V4 I had trained it under wildcard settings against the frozen and highly trained UNET. I used this as a test for my cheesechaser wildcard system, which is currently broken, but it trained insanely fast on those h100s. The outcomes manifested themselves in wild ways and I really don't know what I taught it. There was a tag blacklist, image size checker, and an aesthetic measurer; it's nothing compared to what I use today but it worked.
https://civitai.com/models/950382/simulacrum-v4-lessordeltaorgreater-f1dddf1d2unetclipl
Flux 1S
I took the D1 refined SimV4 clip and used it as the foundation for this model. I noticed the base model responds differently to it, and the more complex the prompts the more it would diverge so I had some high hopes for it's utility.
5,000,000 sample train performed in a few days using H100s on runpod.
Not particularly a big dataset, maybe 75,000 images or something I don't remember exactly but I have them stored. They were mostly identified using about 6 or 7 ais and given short captions at the start of each tag file.
I took the trained FLUX_L that I had created through making FluxD Simulacrum, the one with about... I don't remember, 6 million samples or something, and I used that as the foundation CLIP_L for my Flux 1S training.
The outcomes showed rapid divergence at a much much faster rate than training Flux 1D, so much so that I'm trying to tone down the next version.
Schnell is highly responsive to training when using my Clip, it doesn't need any specific tuning or attachments; just Enable T5 Attention Mask, low cook rate, and a bunch of images. Make sure euler shift is set to 3, and the guidance set to 0.
https://civitai.com/models/1136727/v129-e8-simulacrum-schnell-model-zoo
This CLIP seems to fully accept booru tags, quality tags, at this point showed glimmers of what could be; screen control; which led to another experiment.
It MUST be used with CFG, which makes it quite slow due to the sheer size of Flux, which means it's considerably less popular... But it's most definitely highly potent, very powerful, and produces basically anything you'd want to see... if you have powerful hardware. Otherwise it's going to take a long time and you may not get much in the v129.
SDXL
There was enough evidence to form this hypothesis. The timestep divergence could be manually trained using another diffusion model, and the outcome could be controlled in a much more careful way.
Every one of the 300,000 images used is fully tagged with grid, zone, size, offsets, and everything I could identify with DEEPGHS's software. About 100,000 were given full captions from a combination of T5blip and JoyCaption Alpha One; with the majority of images having shorter captions given from the plain English Quora paraphraser using a caption accuracy of 80% or higher to solidify the linkers. The paraphraser is very fast, but not very smart; it combines what exists in tags as phrases by default and the rephrases it into something cohesive.
Each stage of the training was given specific timesteps based on the cos/sin dot normalization difference from the tested samples on Flux 1S using the exact same prompts. I'm still uncertain if this actually had an effect, so I'll need to replicate this experiment and provide a process for this one.
As of the latest SDXL-Simulacrum V2 full, both CLIP_L and CLIP_G have shown a large gradient of divergence; and yet the CLIP_L still stands strong through the storm.
I specifically chose timesteps for this one, very specifically chose them and mapped their beginning point to endpoint; there is a list of the used and available timesteps on the article related to this.
https://civitai.com/articles/9954/sdxl-sim-v2-full-vs-schnell-sim-v129-the-50mil-sdxl-vs-50mil-flux
Flux 1S - 2 electric boogaloo
Using the trained 128 dim 128 alpha lora specifically trained for the original Flux 1S 8bit, I upgraded the quality to bf16 to guarantee increased fidelity with the diffusion model and provide more accuracy to the gradient shifts in the CLIP_L.
Currently it's cooking, and it's CLIP_L is the end result of the SDXL experiment now. I took the exact same 80%+ accuracy caption fully tagged 300k image dataset and am currently feeding it into the model.
As it stands, in the current setup, with my current hardware; it'll be done in 80 hours.
So far the divergence is showing even more, the Schnell is responding far more cohesively than before.
Divergences from v129:
No longer needs negative prompt for high fidelity images.
CFG fixed at 3.5 for a fair accuracy.
Step counts are shrinking and becoming more accurate earlier.
Much more accuracy than SDXL with poses, offsets, size, and more.
The trained clip will be available alongside the lora release as per the original. I will likely make a full merge this time as well and maybe run a small finetune to improve fidelity before release, but I'd rather leave it raw for continued training.
The available weights for the clips and models will be listed here when the epoch is done.
https://civitai.com/models/1136727/v129-e8-simulacrum-schnell-model-zoo
HunYuan -
I have begun HunYuan testing, and the outcome from the divergent CLIP_L is already showing some serious promise and potential with timestep controllers.
I'm currently mentally planning an experiment and identifying potential utilities for timestep with integrating AnimateDiff controllers and methodologies into HunYuan to provide additional controls and more; specifically treating the CLIP_L's screen control and timestep divergence as a catalyst to make this possible.
Early HunYuan experiments show;
Omega4 CLIP_L performs better under lower guidance rather than higher guidance.
High model shift yields jitter with the CLIP_L in really interesting ways.
CLIP_L knows many details that the original CLIP_L lacks by default; while it also does not know many that the original CLP_L did.
It can definitely handle longer frame counts than I expected, I was making 30 second clips but it took a while.
It can handle very small videos with fair accuracy; 128x128, 256x256, and so on.
Potentially USEFUL Utilities yielded from the experiments so far
COS/SIN dimensional cross contamination guidance
Currently I'm working out a formula that can pass cross contamination routes as defined routes internal to a model's weights; only accessing those cross contamination routes when certain vectorized rules are met.
This lays ground work on a single principle; formation and control of internal structures within pre-defined AI model structures, intentionally diverged through bypass shunts and formed through self learned behaviors as a model is trained.
It's akin to applying grafts to reinforce behavior control within a full structure of pre-defined neurons with a larger behavior. Forcefully merging neurons and snipping the possibility of using others; essentially. Through teacher/student learning.
There are a few similar concepts in AI to this that make me think it'll work.
There are a few potential model weights that could accomplish this goal. I'll be looking into and researching them in the coming days to figure out practical applications for this.
The end result should essentially be the capability to fully wrapper a core AI model with something intensely lightweight with only a small amounts of neurons, fully capable of introducing high fidelity and high accuracy information in locations of inaccuracy. Like a lora hat, but smaller, and only activates like a text encoder, but behaves similarly to a LoCoN in a lot of ways.
Shunts Under Rapid Generation Endpoints
S.U.R.G.E.
This one is a mothballed concept from before that seems to have some new applications.
Originally, the introduction of increased learn rate based on the tests/outcome proved to be unyieldingly impractical. Modifying the classification/generation/outcome code was not simple and the outcome was unreliable. Increasing and decreasing learn rate could be handled based entirely on the scheduler already.
Using cross contamination guidance, the potential for rapid learning improves. Less data produces more of the outcome with zero contamination until activated.
Potential Uses
Rapid introduction of new context information into pre-existing structures.
Rapid removal of unwanted behaviors and details with minimal training time.
Considerably lesser hardware required to train.
Potential Downsides
More than likely it will not fit within the lora guidelines, so it'll need a new way to inference many models different than a LORA; which will make it impractical and inconvenient to immediately test.
Will require more proofs and formulas to test the methodology.
Credits and Resources:
Tensorflow Link
A special thanks to everyone at the DeepGHS for all their hard work and effort when it comes to organizing and preparing tools, AI, and keeping datasets orderly and organized.
Flux1D / Flux1S Link
SDXL 1.0 Link
OpenClip trainer Link
Kohya SS GUI /// SD-Scripts
Images sourced from or by
Partially Prepared for release captioning software using;
ImgUtils Link
Bounding Boxes
BooruS11
BooruPP
People
Faces
Eyes
Heads
HalfBody
Hands
Nude
Text
TextOCR
Hagrid
Censored
DepthMidas
SegmentAnything YoloV8
Classification
Aesthetic
AI-Detection
NSFW Detector
Monochrome Checker
Greyscale Checker
Real or Anime
Anime Style or Age -> year based
Truncated
Hagrid Link
MiDaS Link
Wd14 Link
Wd14 Large Link
MLBooru Link
Captioning


