Type | |
Stats | 1,043 |
Reviews | (178) |
Published | Feb 2, 2025 |
Base Model | |
Hash | AutoV2 66076A003A |
Large scale finetune of Illustrious with state of the art performance.
(tl/dr: Works exactly as it should without flaws you might encounter in other checkpoints.)
V0.7 comes with the dataset of 7M images (~2M with natural text captions):
New knowledge for characters, concepts, styles
Most of the things that was not good enough in 0.6 now on decent level
Even better prompting with good segmentation and natural phrases understanding (now it's even possible to easily make one character to cosplay other without typical issues 1, 2, 3)
Better stability and broader coverage, less biases
Excelent performance for both very simple and complex prompts
Lots of general changes and fixes
Vpred version for v0.7 is released!
Key advantages:
Easy and convenient prompting
Great aesthetic, anatomy, stability along with versatility
Vibrant colors and smooth gradients without trace of burning
Full brightness range even with epsilon
35k+ artist styles, many general styles, almost any character
Pure training without any merged weights, loras, tweakers, etc., you can add your own if needed
An addition to mentioned, comparing with vanilla Illustrious and NoobAI:
No more annoying watermarks
No characters bleed and related side effects (unwanted outfits, style, composition changes)
No spawning of strange creatures, sfx on background or extra pair of breasts
Better coherence, prompt following, anatomy (significant boost over illustrious, slight over noob)
Artist styles look exactly as they should (and lots of new added)
Better prompt following without ignoring tags and need of (higher weights:1.4)
Forget about long scizo-negative
Stable style without random fluctuations on different seeds
New characters
Dataset cut-off - 20th December 2024.
Features and prompting:
Basic:
The checkpoint works both with short-simple and long-complex prompts. However, if there are contradictory or weird things - unlike with other checkpoints, they won't be ignored affecting the output. No guide-rails, no safeguards, no lobotomy.
Just prompt what you want to see and don't prompt what shouldn't be on the picture. If you want to have a view from above - don't put ceiling
into positive, if you want to have crop view with head out of frame - don't make detailed description of character facial features, and so on. Pretty simple but sometimes people are missing it.
Version 0.7 comes with several improvements in prompt understanding and segmentation.
However, because SDXL is based on CLIP text encoders which have limit of 75 (77) tokens for input, it is quite limited due to prompt chunking. If you want to specify some features for character/object and separate them from other prompt parts - make sure they are in the same chunk and optionally separate it with BREAK
.
It will not solve problem of traits mixing completely, but can reduce it improving overall understanding, since text encoders on RouWei are able to process the whole sequence, not individual concepts better then others.
Dataset contains only booru-style tags and (simplified) natural text expressions. Despite having a share of furries, all captions have been converted to classic booru style to avoid a number of problems that may arise when mixing different systems. So e621 tags won't be understanded properly.
Sampling parameters:
~1 megapixel for txt2img, any AR with resolution multiple of 32 (1024x1024, 1056x, 1152x, 1216x832,...). Euler_a, 20..28steps.
CFG: for epsilon version 4..9 (7 is best), for vpred version, 3..5
Sigmas multiply may improve results a bit, CFG++ samplers work fine. LCM/PCM/DMD/... and exotic samplers untested.
Karras scheduler doesn't work well, same for some samplers.
Highresfix - x1.5 latent + denoise 0.6 or any gan + denoise 0.3..0.55.
For vpred version lower CFG 3..5 is needed!
For vpred version lower CFG 3..5 is needed!
Quality classification:
Only 4 quality tags:
masterpiece, best quality
low quality, worst quality
Nothing else. Meta tags like lowres
have been removed, better not to use them. Low resolution images have been either removed or upscaled and cleaned with DAT depending on their importance.
Negative prompt:
worst quality, low quality, watermark
That's all, no need of "rusty trombone", "farting on prey" and others. Do not put tags like greyscale, monochrome in negative unless you understand what are you doing. It will lead to burning and over-saturation, colors are fine out of box.
Hovewer if you want to boost or adjust it - please use extra tags from brightness/colors/contrast section below.
Artist styles:
Grids with examples, list (also can be found in "training data").
Used with "by " it's mandatory. It will not work properly without it.
"by " is used as meta-token for styles to avoid mixing/misinterpret with tags/characters of similar or close name. This allows to have a better results for styles and at the same time avoid random style fluctuation that you may observe in some other checkpoints.
Multiple give very interesting results, can be controlled with prompt weights.
General styles:
2.5d, anime screencap, bold line, sketch, cgi, digital painting, flat colors, smooth shading, minimalistic, ink style, oil style, pastel style
Booru tags styles:
1950s (style), 1960s (style), 1970s (style), 1980s (style), 1990s (style), 2000s (style), animification, art nouveau, pinup (style), toon (style), western comics (style), nihonga, shikishi, minimalism, fine art parody
and everything from this group.
Can be used in combinations (with artists too), with weights, both in positive and negative prompts.
Characters:
Use full name booru tag and proper formatting, like karin_(blue_archive)
-> karin \(blue_archive\)
, use skin tags for better reproducing, like karin \(bunny\) \(blue archive\)
. Autocomplete extension might be very useful.
Most characters are recognized just by their booru tag, but it will be more accurate if you describe their basic traits. Here you can easily redress your waifu/husbendo just by the prompt without suffering from the typical leaks of basic features.
Natural text:
Use it in combination with booru tags, works great. Use only natural text after typing styles and quality tags. Use just booru tags and forget about it, it's all up to you. To get best performance keep track if CLIP 75 tokens chunks.
About 2M images in dataset have hybrid natural-text captions made by Opus-Vision, GPT-4o and ToriiGate, which improves segmentation, allows to get some understanding and provides a response to chain of phrases, not single words. But overall performance is not even close to flux or sd3.5.
Lots of Tail/Ears-related concepts:
tail censor, holding own tail, hugging own tail, holding another's tail, tail grab, tail raised, tail down, ears down, hand on own ear, tail around own leg, tail around penis, tail through clothes, tail under clothes, lifted by tail, tail biting, tail insertion, tail masturbation, holding with tail, ...
(booru meaning, not e621) and many others with natural text. The majority works perfectly, some requires rolling.
Brightness/colors/contrast:
You can use extra meta tags to control it:
low brightness, high brightness, low saturation, high saturation, low gamma, high gamma, sharp colors, soft colors, hdr, sdr
They work both in epsilon and vpred version and works really good.
Unfortunately here is an issue - the model (epsilon) relies on them too much. Without low brightness or low gamma or limited range (in negative) it might be difficult to achieve true 0,0,0 black, the same often true for white.
Both epsilon and vpred versions have like true zsnr, full range of colors and brightness without common flaws observed. But they behaves differently, just try it.
Vpred version
Main thing you need to know - lower your CFG from 7 down to 5 (or less). Otherwise, the use is similar with advantages.
It seems that from v0.7 vpred works flawlessly now. It shouldn't suffer from ignorance of tags close to the 75tokens chunk borders like nai. It is more difficult to get burned images - even on cfg7 usually it just over-saturated but with smooth gradients, which can be usefull for some styles. Yes it can make anything from (0,0,0) to (255,255,255). You will find brightness meta tags described above quite useful for easier/lazy prompting. To get the most dark image - put high brightness
into negative and/or use low brightness, low gamma
tags. If you don't like very bright skin on dark background and want to reduce contrast (or on the contrary, enhance the effect) - use hdr/sdr in negative/posive.
It was reported that in rare cases on some promts there is a drop in contrast. Looks like other vpred models have same behavior with such prompts, adding a "separator" closer to the border of the 75-token chunk fixes this. However, with 0.7 I haven't encountered this myself.
To launch vpred version you will need dev build of A1111, comfy (with special loader node) or Reforge. Just use same parameters (Euler a, cfg 3..5, 20..28 steps) like epsilon. No need to use Cfg rescale, but you can try it.
Base model:
Epsilon and vpred versions here have a brief aesthetic polishing after main training to improve small details and coherence. If you want to use RouWei in merges, extract something without bringing that last things, or finetune it - you can find base version of RouWei on huggingface.
Known issues:
Off course there are:
Epsilon version relies too much on brightness meta tags, sometimes you will need to use them to get desired brightness shift
Some newly added styles/characters might be not as good and disctinct as they deserve to
Inferior in furry-related knowledge compared to NoobAi
To be discovered
Requests for artists/characters in future models are open. If you find artist/character/concept that perform weak, inaccurate or has strong watermark - please report, will add them explicitly. Follow for a new versions.
JOIN THE DISCORD SERVER
License:
Same as illustrious. Fell free to use in your merges, finetunes, ets. just please leave a link.
How it's made
I'll consider to make a report or something like it later. For sure.
In short, 98% of work is related to dataset preparations. Instead of blindly relying on loss-weighting based on tag frequency from nai paper, a custom guided loss-weighting implementation along with asynchronous collator for balancing have been used. Ztsnr (or close to it) with Epsilon prediction was achieved using noise scheduler augmentation.
Spent compute - about 25days of 4xH100 (apart from research and fail attempts)
Thanks:
First of all I'd like to acknowledge everyone who supports open source, develops in improves code. Thanks to the authors of illustrious for releasing model, thank to NoobAI team for being pioneers in open finetuning of such a scale, sharing experience, raising and solving issues that previously went unnoticed.
Personal:
Artists wish to remain anonymous for sharing private works; Few anonymous persons - donations, code, captions, etc., Soviet Cat - GPU sponsoring; Sv1. - llm access, captioning, code; K. - training code; Bakariso - datasets, testing, advices, insides; NeuroSenko - donations, testing, code; LOL2024 - a lot of unique datasets; T.,[] - datasets, testing, advises; rred, dga, Fi., ello - donations; other fellow brothers that helped. Love you so much ❤️.
And off course everyone who made feedback and requests, it's really valuable.
If I forgot to mention anyone, please notify.
Donations
If you want to support - share my models, leave feedback, make a cute picture with kemonomimi-girl. And of course, support original artists.
AI is my hobby, I'm spending money on it and not begging for donations. However, it has turned into a large-scale and expensive undertaking. Consider to support to accelerate new training and researches.
(Just keep in mind that I can waste it on alcohol or cosplay girls)
BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c
ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db
XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ
if you can offer gpu-time (a100+) - PM.