An explanation about my workflow

As you might have seen and noticed, I've added quite some extra stuff to my workflow. Biggest changes are the LLM addition of the Rouwei Gemma adapter and all the other non standard nodes for sampling. The Gemma stuff is the latest that i added to my workflow / idea's, and once set up, is actually pretty much just prompt and go. So it goes last. Get the workflow from the attachments.

The Other Stuff first

So, there's quite some different samplers/text encoders/schedulers/noise generators in there. I could use the default ones. and they'd probably work too. However, i wanted my workflow to have as little "duds" as i could get, meaning the image would be fine to use without post-processing.

Text Encoder

Let's start at the beginning of a prompt, the Text Encoder. Given that i'm going to be using two different encoders (CLIP and Gemma), I'll cover the CLIP one here. Traditionally, this is where you put the danbooru tag artist prompt. However, if you wanted artists, those would consume tokens, and if you added quality tags, those would consume tokens too. Given that a SDXL / CLIP prompt can technically only be 75 tokens long, that would mean you could spend up to 20 tokens per group of 75.

This is because the moment a prompt becomes longer than 75 tokens, it woudl split it in groups of 75 tokens. So 2x75, or 3x75, and so on. However, if only one of those groups would have your artist and quality tokens, those would not apply to the rest of those groups of 75 tokens. When you have long prompts, this would really kinda remove the artist style. So if you're like me, you'd have to make short prompts, to try to not make it overloaded. The other issue was that CLIP is dumb. How often would concept bleed?

So what did i do? Well, first things first: lets get the CLIP to do what it's supposed to do in my workflow ->artist tag and quality tag. Perhaps a supporting tag if rouwei doesn't know it (yet?). Other than that, you're not going to use it a lot.

But as you might notice, I'm doing some extra stuff up top in the model with those encoders. SDXL originally would send in just "normal" 1024x1024 size prompts. This would usually work just fine, but when you start going stuff on different resolutions, this could make stuff break apart, like the double head, extra limbs, and really weird anatomy on higher resolutions. In my experience, this is because the image generated, was supposed to have details for that 1024x1024 resolution. By adjusting these values to more appropriate ones, the amount of bad anatomy and weirdness significantly decreased.

SUNoiseLatent, Soul Sampler, VPScheduler

These three are the core of the sampling process i'm using. SUNoiseLatent is an alternative noise generator, which if i read right, will generate higher frequency noise. Useless for normal samplers like Euler, but for SDE type samplers, this can be used to get a bit more detail and stability in images. Then all we need is a nice scheduler to determine how the sigmas (steps in the denoise process) will go, and we should be good, right? The VPScheduler is just something like karras, simple or beta. Just a collection of numbers. I've tried various combos, but in the end all that matters is the starting number (14.61 for SDXL type models) and the final number (0.03-0.7) depending on model/lora's. Then all we need is to make that curve go as smoothly from beginning to end for the model, trying to ensure we hit as much of the model in the right parts at the right time.

So, we set the SUNoise generator to have a high scale, normal samplers won't be able to work with this. As a counter, we raise the ETA on the soul sampler also to 2. This seemingly works wonders together. The important part now, is that it'll try to go to the most "average" outcome of the prompt specified. Characters will most often end up in the middle, if you prompt 1girl for example. That's something that's just inherent to the trick i'm using, and there's no way around it. The only way to is to decrease the ETA on the sampler below one, making it deterministic again, but then you won't be able to leverage the SUNoise generator properly and should swap in a regular noise generator.

Some Model stuff

FreeU, Epsilon scaling, and tangential dampening. These are some model patches that generally enhance model image stability. Feel free to change the values, or just remove them. Test at your own leisure, i think've set them up at pretty good values.

The Upscaling

Pretty straight forward. It's a controlnet tile upscale using the NoobAI eps-tile controlnet model. Using the text encoder resolution trick, it helps keeping details consistent (up to a certain level). I've added a image blurring node in there, making the tile upscale much better. if images get destroyed during the upscale, either the denoise should become lower, the strength of the tile upscaler controlnet higher, or the blur factor of the image lower. This is all a compounding issue, so there's no straight answer, except for "testing" what works. Again, i think i've set it at a decent medium, so it should work mostly just fine.

Rouwei-Gemma adapter

Right. So after all that, what's left? The part where we have a different text encoder. This is about using the alternative text encoder provided by Rouwei gemma which is trained on danbooru tags. From my limited testing, not all tags work. put those tags in the CLIP in that case. Use natural language in the LLM text encoder.

For installation of this one, i really have to refer to the page on CivitAI. There's some extra nodes you need to install, which are hosted on github. There's also installation instructions on github, but they will not lead you to a working solution. So again, use the instructions on the 0.2 page, but they will lead you to a few github pages of stuff you have to install.

Once all is set and done, you should be able to generate some fun and more complex anime images with illustrious models.

The model Hasphoria to bring it all together

Hyphoria is a pretty good model. The author did a really good job on mixing it. It just needed a better text encoder imho. So i gave it the one from hassakuhentai. The unet on this one is a bit more rough, but that's what model mixing is all about. Getting the stuff from the models you like and then make your own. In my case, more artists, and crazier prompt understanding.

I'm sure there's stuff i'm forgetting. If you read this article and have some questions, i can often be found in the CivitAI chat on discord. Happy generating.

Hasphoria - An Illustrious model intended for use with LLM prompting