Sign In

SDXL Sim-V2Full UltraRes αβγδ [SFW/NSFW]

22
337
9
Updated: Feb 20, 2025
base model
Verified:
SafeTensor
Type
Checkpoint Trained
Stats
62
Reviews
Published
Feb 1, 2025
Base Model
SDXL 1.0
Training
Steps: 9,375,000
Epochs: 75
Hash
AutoV2
709D769982

Watch as the model shifts... into your desired depiction.

This is an ongoing experiment to provide more control to image generation.

Using the grid helper lora amplifies screen and depiction control at lower epochs, and enables a series of different grid and spritesheet capabilities. At higher epochs it enables a stronger screen control at the cost of quality and context.

SDXL-Simulacrum Full V2 αβγδ release 1/31/2025 - 5:00 PM

I dub this model, LOW IQ SDXL FLUX.

  • α version was roughly 50,000 images burned to 0-2 million samples.

  • β version was roughly 75,000 images burned to 2-5 million samples.

  • γ version was roughly 150,000 images burned to 5-7.5 million samples

  • δ version was roughly 300,000 images burned from 7.5-10 million samples.

I have a more accurate list of the trainings used below.

The outcomes seem to favor much higher resolutions than lower, so don't spare the rod.

The Full V2 version is highly complex and very difficult to describe how it works in a simple manner; however I'll sum up this entire model into a very simple description here.

Use PLAIN ENGLISH in a structure that makes sense.

This model builds what you want, in sentence sequence and a semi-logical booru flowchart.

Plain English captions are based on sentencepiece. Most LLMS, including the T5, were trained under unsupervised training using sentencepiece. The inspiration for the foundation and methodology behind the processes for captioning, is based entirely on LLMs and structures. These structures are conjoined with vision based classifiers, bbox identifiers, and an interpolation between various forms of identifiers using depth analysis. If the caption wasn't generated BY a sentencepiece model, it was generated with the outcome from the concept in mind.

For VERSION 3 I will expand the dataset to well over 2 million images; all captioned with both plain English captions and depiction offset based tagging.

They WILL NOT be trained together; they will instead be trained as two separate cloned datasets. One dubbed tags file, one dubbed caption file.

One with Booru based tags and small cations < 30 caption tokens, one with plain English < 10 booru tags; mirrored and sister datasets, trained on alternating timesteps.

The Booru tagging will be shuffled, and the English captions will be orderly.

GENERATING IMAGES

  • ComfyUI is the only generator with intricate enough timestep control for IMG2IMG and TXT2IMG.

    • The ironic part is, the timesteps ARE imperfect; but they are pretty close.

    • I have released two starter timestep context mover COMFYUI WORKFLOWS with starter timestep usage and double prompts meant for CLIP_L and CLIP_G.

    • This is NOT your run of the mill SDXL, you will not get the same results, and it will produce off-putting and sometimes disturbing outputs if you stray too far from the timestep guidelines, especially when you ask for twisted shit.

  • If you want the FULL EXPERIENCE of this model, you MUST use ComfyUI and play with timesteps.

    • I have listed a semi-accurate list of trainings below based on the trained timesteps. The math I used to determine these timesteps is SIMILAR but not fully accurate to the Flux Shift that the CLIP_L was originally finetuned to cooperate with when training Flux; but it will do in a pinch.

  • Forge works, but not as well.

  • I MADE SURE it looks good on Forge, so you CAN use Forge; but the context suffers as both the CLIP_L and CLIP_G have intentionally different behavior.

TLDR Generation Settings:

DPM++SDE 2M -> Beta / Karras

CFG 6.5 - 7.5 -> 6.5 is my favorite

Steps -> 12-100 -> I use 50 mostly, low steps work.

Sizes -> Too many.
The RULE OF 3 is the foundational principle in this model. All of the captions are based on this concept, so the rule of 3 will function similar to Flux. Stick to the rule of 3, and you will be okay. Stray too far and you're gonna have a bad time. You can reinforce it by supplementing grid, zone, depiction, size, and identifiers specifically associated with those.

Describe what you want to see in plain English, give it styles, give it artists, give it characters, give it clothes, send it to the machine. Out comes your image with combined styles, with artistic styles overlaid, and the characters imposed in those settings. You can give it grids, offsets, angles, and so on. It will probably understand what you want.
Negative anything you don't want to see in sequence from most important to least important.

Consult the tag document for the specifically trained and important attention shifting tags.

Be WARY and VERY careful what you type into this thing.

It's essentially a dumb Flux. It gives you what you want, and sometimes you get the monkey paw reward along with it.

It builds IN SEQUENCE.

  • Everything you put in the prompt above, has precedent power over everything you put after it. Some tags come with baggage, some tags do not.

  • Using PLAIN ENGLISH has a very powerful effect, designed specifically to allow EASE OF ACCESS.

  • This does NOT always work yet, as it was one of those version 3 guideposts that wasn't met; however, it definitely has a very powerful effect.

  • It has VERY MINOR shuffle training between timesteps 4 and 8, everything else is entirely based on sequential access. I will be including MORE timestep trainings specifically designed to shift attention with more images using shuffle training for the next version;

  • I have marked timesteps;

    • 12-16

    • 22-24

    • 30-36

    • 41-50

    These are specifically allocated FOR THE NEXT VERSION to attention shift, finetuning context, and high fidelity inclusion of supplementary details in sequence. Aka shuffle training and quality boosting training steps. ANYTHING OVERLAPPING will not matter, as the data will supplement each other.

  • This has a VERY HIGH POTENCY EFFECT in COMFYUI when using timestep controlling; especially when using CLIP_L and CLIP_G prompts.

  • This cake's recipe was not a simple one. In fact, I'd say it's the most intricate and carefully plotted model I've ever made. It depicts both the great achievements; including the successful experiments, and new proofs for the community; but it also depicts some of the greatest failures, most painful incorrect assumptions, and the most painful images I've ever seen.

  • For THIS VERSION;

    • 0-1000 full finetuned baseline -> full finetune, LoCoN full, LOHA full, Dreambooth, and LORA used.

      • CLIP_L trained, CLIP_G frozen.

      • 5,000,000 samples,

      • 57k images; 1/3rd anime, 1/3rd realistic, 1/3rd 3d

        • grid -> did not take

        • hagrid -> did not take

        • pose -> took very well

        • human form -> took very well

        • ai generated -> took very well

    • 1-999 first iteration img2img training -> attention training half, dreambooth half

      • CLIP_G training enabled.

      • 200,000 samples,

      • 51k images; pruned first pack, many fetishes and bad images removed

        • removed many hagrid images for blurring hands

          • many classifications removed entirely and will need recaptioning

        • removed all images labeled very displeasing in ai generated

    • 10-20 first pass shuffle -> attention training only -> LOKR training only, 5 versions with different settings.

      • Increased LR for CLIP_L and CLIP_G

      • 1,000,000 samples no English captions,

      • 75k images ->

        • mixed safe/questionable/explicit 3d dataset added

          • full pose angle set, full array of artists, full fetish set

        • ai generated removed entirely

    • 10-990 second pass shuffle -> full finetune, LOHA, LoCoN used.

      • Reduced LR for CLIP_L and CLIP_G

      • 150,000 samples no English captions

      • 115k images

        • mixed safe/questionable/explicit/nsfw anime dataset added

        • hagrid removed entirely for re-planning for version 3.

    • 2-8 second pass English cohesion > attention training only, heavy shift to the goal.

      • High LR for CLIP_L and CLIP_G

      • 800,000 samples

      • 8k images specifically tailored for English descriptions and grid/offset/depth

        • Bucketing and cropping disabled; 1024x1024, 768x768, 1216x1216, 832x1216, 1216x832, 512x512

        • Grid training meant to function as a binding agent.

    • 8-992 third pass English cohesion low LR -> full finetune

      • Normal LR for CLIP_L and CLIP_G -> they have normalized

      • 800,000 samples

      • 140k images specifically tailored for English descriptions and context

      • Bucketing re-enabled.

    • 1-999 final pass burn -> full finetune, very low learn rate 1/10th original

      • CLIP_L and CLIP_G now cooperate rather than fight.

      • 2 million samples very low learn rate all captions, all tags

      • all images including the omitted images were included, except hagrid

      • trained the entire dataset in epochs, rather than in curriculum

      • Roughly 300k images used give or take, I think.

You MAY see some nsfw elements while prompting using safe

  • Even with questionable/explicit/nsfw negative prompted; but it is currently fairly rare. If you see them don't worry about them impacting the next version in a negative way, I have a full 1 million safe images lined up for the next version to make sure this DOES NOT happen unless the prompter WANTS to see such things.

  • Many female forms were trained specifically in the nude, which imposes clothes after based on the sequential learning pattern and timestepping. This may have your preview sampler showing nudity, distortions, deformity, and more before it cleans up.

  • Be warned that it may NOT clean up, but it generates pretty fast if you're using the single pass ComfyUI so just hit the next seed if something doesn't work. There's a chance it will work, you just haven't hit the right seed yet.

Watching the images generate often looks like a slideshow

  • This is fully intentional. Some of these images can be disturbing and I apologize if you see anything disturbing in this slide show. The final pass did some damage to them but not enough to fully blend them, so be very wary when prompting nsfw elements.

  • The next version will have a full finetune for the safe tag, to ensure many of these elements are superimposed unless prompted, but for now bare with the negative prompt please.

Careful NSFW curating of prompts

  • There are often genitalia that appear, distortions, objects, extra limbs, and more. If you start seeing things like this solidify, you can use things like censored in the positive prompt, which is literally a depiction offset tag designed to tackle this exact purpose.

  • It WILL will censor genitals and nipples. If they continue to show up, you can tell it EXACTLY where you want censored;

    • grid_a3 censored nipple. It'll get the idea and the concept will bleed through the image if you don't use the size tag along with it, put this in the positive prompt

    • nipple, nudity, nude in the negative.

    • It WILL go away.

  • SDXL has many horror movies built into it's training. You can tell it was given the IMDB dataset, and this often hurts many images; or even introduces horrifying elements. The most annoying part I've found is trying to burn the ages out. I don't even know what sort of tagging they used, but it's not something I've managed yet.

  • If you see anything horror or age-based, negative "futanari, femboy, loli, shota, horror, monster, gross, blood, gore, saw, ghost, paranormal", and most of the artifacting from the IMDB horror and any training built into SDXL will go away.

    • Nothing I can do about this in this version, I've already tried a couple ways to burn it, and it just ends up hurting everything, so I'm going to need another solution.

    • I tried including false images in these tags and it only associated everything I trained with the other tags, into the horror sections; causing a massively terrifying version that I will never ever release.

      • Though I know now how to make cool halloween loras better, so that's cool.

    • I do so apologize for this, as I'm usually very careful at curating this sort of response, but this time I cannot control every element in SDXL yet. I require more research and more testing.

  • Negative genitals if they appear. The penis being a primary one that tends to show up, simply just negative it and it goes away. It knows what it is. It also knows what the majority of condom based things, sex toys, and so on are; so you can negative everything away if the negative questionable, explicit, nsfw aren't doing the trick.

    • "penis, vagina, penetration, sex toy, dildo" and so on can be put in the negative prompt to nearly guarantee they won't show up; but they will if you prompt them in positive with negative, and there are some artists and styles with many images related to those. So be careful.

NSFW elements CAN BE TERRIFYING.

  • This version's nsfw prompting does NOT do well with complex plain English scenes yet, but it does work.

  • Keep your plain English prompts short, and stick to the Booru and Sim tags. You will produce OKAY NSFW context results if that's the goal, but nothing to phone home about yet in terms of cohesive fidelity.

  • You can have better luck adding a style or two, adding some artists, and so on. Kinda nudging it towards what you want to see. If the artist is in there, it'll probably work. If not, you can try one of the stronger ones in the list.

  • If you're hoping for an easy porn maker, you will have some luck with simple prompts, but the more complex you get, or the more plain English you include; the more abomination-looking your outcome will become.

Barebones:

  • ComfyUI Guidelines and Workflows

  • Full tag list and counts

  • Overcooked Portions

  • Undercooked Portions

  • Cache corruption time consumption

Accidentally published early. For those who managed to get it, don't share it please; but if you must then go for it.

It'll be out at 5pm officially. -> ETA 11 hours

I've decided since this version never quite made it to the markers for V3, that I would dub this as the FULL version 2 release. It's hit as many markers as it will with this dataset, so I'll need to expand the dataset to nearly 3x or 4x the images to fill in the necessary missing information; so we're looking at 1.5 to 3 million images. Which is roughly a third of a big booru.

Getting that many images with sections that can be identified and segmented will involve sampling every single database I can find; including datasets like Fashion, IMDB, and anything I can get really. If I am to make this model SMART, then it needs to know what everything is, and where that everything is; because it's still needs a lot of data.

I'm going to begin hosting these fully tagged and prepared datasets on huggingface in tar parquet format so my custom cheesechaser can grab at it for you if you want.

I will do the ole smudge face thing for real people like I usually do, which is why some of them turn into anime characters by the way. SDXL has a bunch built into it already, so it's clearly been taught the IMDB dataset, which means I know what I can negative teach.

SDXL-SimulacrumV25β

Currently on epoch 65 ->

7.5 million samples give or take.

The teasers show off the intentional style and series bleeding, which is exactly as intended.

How many models were a pain in the ass to finetune BECAUSE something overwhelms something else, and that something else gets in the way? Well, not this one. EVERYTHING is directly easy to finetune, by design.

It's now hit 85/100 markers. I anticipate it being done by tomorrow or the day after.

Generation Recommendations:
DPM-2M-SDE
-> BETA / KARRAS
-> Steps 14-50 -> 50
-> CFG 4.5-8.5 -> 6.5

DPM-2S-Ancestral
-> BETA / KARRAS
-> Steps 32
-> CFG 5 - 8 -> 6

DPM-2M
-> BETA / KARRAS
-> Steps 20-40 -> 40
-> CFG 7 -> 7

Euler doesn't work very well.

PROMPT BASICS HERE

<CAPTIONS HERE>

good aesthetic, very aesthetic, most aesthetic, masterpiece,
anime, 
<CHARACTERS HERE>

<ACTION CAPTIONS HERE>

<OFFSETS AND GRID GO HERE>

<CHARACTER TRAITS HERE>

highres, absurdres, newest, 2010s

Try not to breach 75 tokens for this version. The CLIP_L has been trained with 225 but they definitely aren't smart enough yet.

This helps make most images better.

good aesthetic, very aesthetic, most aesthetic, masterpiece,

TLDR: Use this NEGATIVE PROMPT to get started.

lowres,
nsfw, explicit, questionable, 
displeasing, very displeasing, disgusting, 

text, size_f text, size_h text, size_q text,
censored, censor bar,
monochrome, greyscale, 
bad anatomy, ai-generated, ai generated, jewelry,

watermark, 
hand, 
blurry hand,
bad hands, missing digit, extra digit, 
extra arm, missing arm, 
convenient arm, convenient leg, 
arm over shoulder, 
synthetic_woman,

Barebones negative: use at your own peril.

lowres, 
displeasing, very displeasing, disgusting, 

text, 
monochrome, greyscale, comic, 
synthetic_woman,

Credits and Links:

  • A special thanks to everyone at the DeepGHS for all their hard work and effort when it comes to organizing and preparing tools, AI, and keeping datasets orderly and organized.

  • Flux1D / Flux1S Link

  • SDXL 1.0 Link

  • OpenClip trainer Link

  • Kohya SS GUI /// SD-Scripts

  • Images sourced from or by

    • Cheesechaser Link

      • Safebooru

      • Gelbooru

      • R34xxx/R34us

      • 3dBooru

      • Realbooru -> smudged face

    • ImageGrabber Link

  • Out-of-scope Datasets Used

  • Partially Prepared for release captioning software using;

    • ImgUtils Link

      • Used an entire array of available AIs in this pack plus more.

      • Bounding Boxes

        • BooruS11

        • BooruPP

        • People

        • Faces

        • Eyes

        • Heads

        • HalfBody

        • Hands

        • Nude

        • Text

        • TextOCR

        • Hagrid

        • Censored

        • DepthMidas

        • SegmentAnything YoloV8

      • Classification

        • Aesthetic

        • AI-Detection

        • NSFW Detector

        • Monochrome Checker

        • Greyscale Checker

        • Real or Anime

        • Anime Style or Age -> year based

        • Truncated

    • Hagrid Link

    • MiDaS Link

    • Wd14 Link

    • Wd14 Large Link

    • MLBooru Link

    Captioning

    • JoyCaption AlphaOne Link

    • T5XXL Blip2 Link

    • SentencePiece Link

    • Quora T5 Small Paraphraser Link

    • SegmentAnything YoloV8 Link