Last edited: 01/06/2024
This article is a presentation of the Eclipse XL model that I and Wasabyia trained together that will be expanded upon in the future to near future. The CSV attached is a file you can use in the booru autocompleter extension for Auto1111 that contains all tags that were used during the training of the model.
Tagging Process
Tools used
Auto-Taggers
We used a combination of 2 auto-taggers, it permitted us to cover a wider range of tags, since they were trained on different datasets:
Smilingwolf: WD SwinV2 Tagger v3 DeepGHS: Caformer models, as well as image classification, and aesthetic score
These taggers were trained on boorus and rule34 databases, the problem is that some tags in these databases are ambiguous and are bad practice in order to train a model so as explained later we changed some tags to better ones.
Image Cleaning
We cleaned the images ourselves and using automated tools in order to:
Remove signatures: diminish the requirements for signature in the negatives
Remove text, heart and symbols: these elements create noise on the image and are hardly controllable, so we removed them from a lot of our images
Crop images: when the subject of the image is not centered, the image is mostly empty, letter-boxed, or has any other fancy style, we try to remove it to diminish the number of unwanted training and to focus more on the subject
Merged the image with a white under layer to remove transparency: we did that in order to control the colour of the background, this way the tagging is accurate to the background (white background, simple background) Not all images were cleaned this way, we did it when we had free time.
Tags filtering and Our custom Tagger
We made a tagging program that combines all of our tools and scripts in a fast local interface: https://github.com/HaW-Tagger/HWtagger
Tokenizer and Training Difficulties on PonyV6XL
Score issues on PONY
Scores are an issue because they take up way too much tokens for each image, to drive the point, here is an example of a caption on an image:
132 tokens: score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up, source_anime, blue bandeau, mechanical halo, short dress, necktie between breasts, mask, tiara, detached collar, blue necktie, black gloves, armored boots, from above, heavy breathing, thighs, skindentation, simple background, bare shoulders, cleavage, solo, blush, underboob, 1girl, sweat, alsace (azur lane), huge breasts, armpits, steaming body, arms behind head, parted lips, blue hair, white background
96 tokens: masterpiece, blue bandeau, mechanical halo, short dress, necktie between breasts, mask, tiara, detached collar, blue necktie, black gloves, armored boots, from above, heavy breathing, thighs, skindentation, simple background, bare shoulders, cleavage, solo, blush, underboob, 1girl, sweat, alsace (azur lane), huge breasts, armpits, steaming body, arms behind head, parted lips, blue hair, white background
We also know that tokens that are at the front absorb more information than other, are more emphasised for training so it's an issue when you add more than 30 tokens at the front, it weaken everything else: the creator of pony is aware of the problem and will try to fix it in PonyV7.
We started making this checkpoint with multiple goals in mind, one of them being to ease the difficulty on training LORAs in PONY.
Basically when you train a LORA you are faced with the following dilemma, you either:
use the scores tags as intended (score_9, score_8_up, etc ... )
use one score tag and hope the user uses this one or your LORA become unusable
ignore the scores completely If you don't use the scores, during the image generation, scores (especially for users that use them extensively) will kinda overpower the LORA and may make it look under-cooked, when in reality it's properly trained. If you use all scores, your trigger tags will be weakened and in more difficult LORAs, your smaller trigger tags will disappear. Using one score is a compromise but It will make it become a trigger tag, not ideal either.
General tags issues
So basically SDXL uses OpenCLIP to convert text (prompt or caption) to number via the tokenizer, then these numbers are sent to the Textual Encoder which influences the generation.
The tokenizer is problematic, it's cool and all but it's flawed and biased, it uses a table of tiny text (example: on, floor, table, ",", xxxxx, ...) elements that are associated to a unique number. There are multiple aspects to that but it boils down to common long and short English words are one token and the rest is subdivided, and spaces (hyphen and underscores too) act as delimiters:
"explorecanada ": 1 token: 36600(explorecanada)
"explore canada ": 2 tokens: 5147(explore), 2698(canada)
"explorefrance ": 2 tokens: 10405(explore), 3552(france) In these examples, explorecanada, explore canada and explorefrance can't be understood to have something in common by the model, they don't have any token in common.
We can all see the problems coming: NSFW words that are not anatomical are not understood as one token, japanese and other non-english words are not one token so they simply are a combination of other smaller tokens.
Why does it matter ? This number is the source of the identity of the word, so merging and blending the tokens is a source of problems for some tags that have tokens in common (Did you try prompting fushiguro in PonyV6 XL base ? spoiler: You can't despite having more than enough images in boorus, and rule34)
There are plenty of cases of prompt bleeding:
jet-black skin: it generates characters with black skin and jets
fushiguro: shares the tokens with, guro so the image is unusable
... (I will add more when I remember more)
Other tags that are not one token and may bleed into a lot of concepts: most characters, fellatio (2 tokens), cunnilingus (4 tokens), futanari (3 tokens), anal (2 tokens), ahegao (2 tokens), yukata (2 tokens), serafuku (4 tokens), ...
So we thought about that and when we tried to come up with new tags we thought about a few things, that will help for a better training:
small token length
avoid bleeding into unrelated tags
proper bleeding with related tags (example: red leotard bleeds with blue leotard and red shirt)
Example: face of the people who sank all their money into the fx (meme) -> fx face
Custom Tags
Disclaimer: Not all tags used or presented in the article work in the current version of our model, we simply decided to add them in case we have enough images to support them in the future.
Quality tags
We wanted tags that had one token length and were only meant for quality tags, so we avoided: bad that can be used for other things. We came up with:
masterpiece: 5088 images
best: 6228 images
great: 12023 images
good: 18615 images
average: 16993 images
worse: 2690 images
worst: 2025 images We tried to aim for a high amount of middle ground images, and a relatively low amount of low quality images, so most images are located below masterpiece and best. It's important to have lower quality images to improve the understanding of what makes a good image.
Composition tags
One of our pet-peeves in the images generation world is the lack of big models (for SD1.5 all models, and for SDXL we didn't delve deep in realistic models) that supports wild lighting styles so we made sure that our model can support more advanced lighting to improve the control during image generation. We used the composition (1 token) tag to gather lighting related information:
dark composition: No light sources and really darkened image
dun composition: The image is fully visible but the light is really diffuse, no glowing light sources
ambient composition: The image is fully visible and there is a glowing source of light (artificial, glowing, focus,...)
dim composition: The image is fully visible there are multiple sources of light
bright composition: Images that use normal~bright colours with very strong highlights (white highlights)
vibrant composition: Images that heavily use high intensity colours for every part of the image (not only the highlights)
contrast composition: when there's both dark and light parts of the image (ex: heaven & hell, left (dark) & right (bright), etc), all images parts need to be part of the contrast
Style tags
Flat color
This style is defined as having flat colours, no shadows or highlights.
Illustration Style
This style is defined as an anime image with basic shading (little to no gradients used for shade).
Anime coloring
This style is defined as close to a screenshot style of an anime.
Western style
This style is defined as any western images that don't match the base style.
3Ds
We have separated 3D into multiple styles that are hard to properly define:
3D blender: common in overwatch porn
3D koikatsu: images that mimics the style of the game
3D filmmaker: images coming from the source engine video animation software (mostly low quality 3D)
3D mmd: images that have a miku-miku-dance style
Realistic & Photorealistic
Realistic are mostly 3d images but not restricted to them. Photorealistic images are restricted to real photography.
Artists
In order to refine the base style of the model, we needed to remove images that had a strong style that didn't really belong to our desired style but still held precious concepts, so we decided to create tags that permitted us to separate artists from other tags, as we didn't want to use the artist names directly since we will merge artists that are similar in styles. We decided to use the artist token that is not followed by a space and a sequence of characters that are unknown but still make one tokens: artistooo artistppp artistdd artistssss artistjj artistzz artistxx artistsss artistzzzz artistnn artistmmmm artistmm artistcc
Dense & Intricate
Warning: dense & intricate should be used in conjunction with tags that necessitate dense or intricate settings, without that random things will happen and corrupt the output.
Intricate: the details on objects/subject is tightly packed and is not a simple design (ex: lingerie, complex dress, designed armor trims, multiple accessories, etc).
Dense: image that has multiple objects/subjects that makes the image more densely packed.
Other
Bold lines Sketch ...
Translucent, Transparent, See-through and X-ray tags
In most models, there is no clear distinction between what's translucent, transparent, see-through, x-ray, cross-section and the definitions in the websites that hosts tags is not necessarily great for training a model. So we are making the following changes:
Cross-section: only for when a zoom-up is made in an internal action not directly on top of the organ
X-ray: when you can see trough a body part, object, wall, floor, ..., but the character inside of the image can't, you wouldn't be able to see through the object in real life
In this way, you need to append a specific tag to make the specific part see through: x-ray uterus, x-ray wall, x-ray stomach, x-ray body, x-ray ass, x-ray throat, x-ray chair, ...
See-though: for objects, clothes, that are see-through/wet, like white dress, or sheer fabric
Transparent: for objects/characters that are transparent, a glass-like material: transparent raincoat, transparent chair, ...
Translucent: for opaque but see-through characters/objects: translucent skin, translucent chair, ... For more horror settings, like a missing skull for zombies, it would be: exposed brain, we don't recommend prompting for gore, or these exposed tags as the results are HEAVILY disturbing.
Altered tags
Some tags were changed for various reasons, including merging tags from different sources (rule34 and danbooru):
bunny earrings, bunny mask, bunny ornament, ... -> rabbit earrings, rabbit mask, rabbit ornament, ...
minor emblems (bc freedom, sakura empire, ...) -> emblem
hardhat -> hard hat
head mounted display -> head-mounted display
heart lock (kantai collection) -> heart-shaped lock
captive bead rings -> bulb nipple rings
ofuda on nipples -> ofuda pasties
ofuda on pussy -> ofuda maebari
talisman -> ofuda
stump cover -> covered stub
falling petals -> falling leaves
peeping -> peeking
rear naked chock -> rear choke hold
fig gesture, fig sign -> segg gesture
toast (gesture) -> toasting (gesture)
bat -> bat (animal)
school of fish -> multiple fish
dancer (fire emblem: three houses), dancer's costume (dq), dancer (three houses) -> dancer
otoko no ko -> femboy: femboy is better in terms of token than otoko no ko
housewife -> mature female
operator -> military operator
newhalf -> futanari: futanari is better because pony understands it better
biker clothes -> bikesuit
labcoat -> lab coat
fortified suit -> pilot suit
gothic fashion, gothic -> goth fashion
kabuto (helmet) -> kabuto
japan ground self-defense force, japan maritime self-defense force -> japan self-defense force
micro panties -> micropanties
nightwear, sleep wear -> sleepwear
f-22 raptor, f-15 eagle, f-14 tomcat -> fighter jet
spacesuit -> space suit
topless male, topless female, male swimwear challenge -> topless
telnyashka -> white shirt, stiped shirt
assisted exposure, wardrobe error, unlikely accident -> accidental exposure
bralines -> bra through clothes
breasts outside -> breasts out
josou seme -> crossdressing
dissolving clothing -> dissolving clothes
jolly roger -> skull and crossbones
qing guanmao -> qingdai guanmao
chest sarashi -> sarashi
pussy cutout -> crotch cutout
tail slit clothes -> tail cutout
panafrican bikini, pan-african bikini -> paf bikini
swim briefs -> swim trunks
beltbra -> belt bra
bunny suit -> playboy bunny leotard
x pasties -> cross pasties
power suit (metroid) -> power suit
3boys,4boys,5boys,6+boys -> multiple boys: we decided to remove the higher counts since they don't work in most cases anyway
3girls,4girls,5girls,6+girls -> multiple girls: we decided to remove the higher counts since they don't work in most cases anyway
submissive female -> femsub
submissive male -> malesub
face of the people who sank all their money into the fx (meme), face of the people who sank all their money into the fx -> fx face
bar (place) -> bar
intravenous drip -> iv drip
speed lines -> motion lines
cobweb, cobwebs, spider webs, spiderweb, spiderwebs -> spider web
yukkuri shiteitte ne -> yukkuri
cola -> coca-cola
shaved head -> bald
poke ball (basic) -> poke ball
one-eye closed -> one eye closed
pointed teeth,pointy teeth, spiked teeth -> sharp teeth
track mark, track marks -> drug mark
meat lines, cut lines, meat marks -> livestock mark
glory wall,stuck in wall -> through wall
partially visible anus -> anus peek
clothes , nude -> clothed , nude
cum while penetrated, ejaculating while penetrated -> forced ejaculation
gaping vagina -> gaping pussy
mind control -> hypnosis
netorare -> ntr
cleft of venus, partially visible vulva, clitoris slip -> pussy peek
egg vibrator -> remote control vibrator
artificial vagina -> onahole
ass smack -> slapping
cervical penetration -> cervix penetration
urethral -> urethra penetration, urethra insertion
flashing -> exhibitionism
masturbation through clothing -> masturbation through clothes
prostration -> dogeza
female ejaculation -> squirting
stiletto heels, pumps,strappy heels,lace-up heels,gladiator heels -> high heels
panzerkampfwagen iv, pzkpfw iv -> panzer iv
doujin cover, magazine cover, cover image -> cover page
1980s (style), 1990s (style) -> retro artstyle
crack of light -> sliver of light
sample watermark,character watermark,commission watermark,photo date watermark -> watermark
all signatures, qrcodes, patreon username, etc ... are converted to signature only
guns, weapons, planes and tanks were merged to remove most show specific or country specific variations
Specific uniforms (maid, school, swimsuits, tracksuit, ...) were merged to a simplified name
Some character variations were merged
Added Tags
Tentacles
We currently have little amount of specific dataset for tentacles, when we start releasing LORAs for them, it means that the next iteration of the model will be better at them:
living wall: when a wall is made of fleshy tentacles
living floor: when a floor is made of fleshy tentacles
Condoms
Condom packet strip: a band of condom wrappers
Condom wrapper
Opened condom wrapper: when the condom wrapper is torn/opened
Holding condom wrapper: when you hold the condom wrapper (replacing holding condom in this case)
Holding condom packet strip: when you hold the condom packet strip
Tattoos
spade tattoo
ass spade tattoo
breast spade tattoo
vine tattoo
coiled tattoo
Ofuda & Talisman
We have tweaked the tags for ofudas to the followings:
chinese ofuda/japanese ofuda: for ofuda that are akin to the jiangshi or miko
ofuda maebari: ofuda that covers the pussy
ofuda pasties: ofuda that cover the nipples
ofuda on clothes/head/legs...
floating ofuda: ofuda that floats around
ofuda between fingers: ofuda that are held between fingers
ofuda between breasts
burning ofuda
hanging ofuda: fouda that are badly attached
ofuda panties, ofuda bra, ofuda slingshot swimsuit: ofuda that are attached as strings instead of glued to the body
POV
pov hands: when both hands are visible
pov one hand: when one hand is visible
viewer holding {bottle, box, condom, ...}
viewer grabbing breast
Textures
We added some generic textures to objects that we will expand on in the future, don't hesitate to suggests new ones: warty: surface covered by small bumps bumpy: surface that has medium to big bumps, spaced between each others spiked: surface with spikes
Dildos
bumpy dildo
ribbed dildo
spiked dildo
gross dildo
warty dildo
Gloves
warty gloves
Tentacles
warty tentacles
bumpy tentacles
Training Info
We used batch 10 on 63k images for a total of 10 epochs. Here are the resolutions trained on:
We also implemented a custom dropout function to regulate the tags that were bad for gens: we didn't drop signature, watermark, ... but we did drop masterpiece and best more often. We shuffled the quality tags in the first 25% of tags so they weren't always the first tags during the training. We will add more information about the training if necessary.
Prompting Recommendation
General prompting
We have a bit of different tests between Hecatonchirea and Wasabiya but we recommend doing this for settings and base prompts:
Steps: 24~36
Resolution: 832x1216
Highres: 12 steps, with a 1.5 upscale and a 0.35 denoise using the upscaler of your choice
Positive prompt: masterpiece, best, great:0.5,
Negative prompt: worst, worse, average, signature, watermark, film grain, blurry
NSFW prompts:
positive: masterpiece, best, (great:0.5), uncensored,
SFW prompts:
negative: nude, nipples, pussy, penis, anus, uncensored, mosaic censoring, bar censor
General Tips
Don't hesitate to combine styles and composition tags to get a more exotic result
Don't hesitate to add styles but lower their weights: (anime coloring:0.3 outputs great images)
Since we heavily tagged our images, we recommend trying to always have a camera position (cowboy shot, full body, etc, ...), hair colors, eye colors, etc, ...
Use the tags.csv that we share, it contains all the tags that we used for training, try to use this and your outputs will be better
Specific Tips
Furry
We have furry images but they are not our main concern. If you want furry, we recommend to add furry female/furry male, western style, and to put 3d in the negatives, snout, paws, fewer digits and fur colors can help to make it more western too.
Pony
We don't have any pony images in our dataset, but the model should still know them from base PONYV6.
We will expand specific tips when we experiment with the model more.
LORA Training Recommendation
We recommend to use the Derrian distro to train LORAs using the following config:
[[subsets]]
caption_extension = ".txt"
image_dir = "/path/to/images"
name = "GOOD"
num_repeats = 1
shuffle_caption = true
[train_mode]
train_mode = "lora"
[general_args.args]
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
pretrained_model_name_or_path = "path/to/checkpoint.safetensors"
vae = "path/to/sdxl_vae.safetensors"
sdxl = true
no_half_vae = true
full_bf16 = true
mixed_precision = "bf16"
gradient_checkpointing = true
seed = 42
max_token_length = 225
prior_loss_weight = 1.0
xformers = true
cache_latents = true
cache_latents_to_disk = true
training_comment = "by trainer"
max_train_epochs = 30
# (1000/NUM_OF_IMAGES)*BATCH_SIZE
[general_args.dataset_args]
resolution = [ 1024, 1024,]
batch_size = 2
# for a RTX3060
[network_args.args]
network_dim = 16
network_alpha = 8.0
min_timestep = 0
max_timestep = 996
[optimizer_args.args]
optimizer_type = "Prodigy"
lr_scheduler = "cosine"
loss_type = "l2"
learning_rate = 1.0
max_grad_norm = 1.0
min_snr_gamma = 5
zero_terminal_snr = true
[saving_args.args]
output_dir = "C:/XL_LORA"
save_precision = "fp16"
save_model_as = "safetensors"
save_last_n_epochs = 1
save_toml = true
save_toml_location = "C:/XL_LORA"
output_name = "lora_name"
[noise_args.args]
multires_noise_iterations = 6
multires_noise_discount = 0.3
[bucket_args.dataset_args]
enable_bucket = true
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
[network_args.args.network_args]
conv_dim = 16
conv_alpha = 8.0
[optimizer_args.args.optimizer_args]
weight_decay = "0.01"
betas = "0.9,0.99"
decouple = "True"
use_bias_correction = "True"
Using prodigy with these settings is full proof because prodigy auto corrects the learning rate, never change the number of repeats and the learning rate with this config, instead adjust the number of epochs to train it on 2000 steps in total (real not what's shown). There is no need to change any config in the optimizer or the network arguments. You can train on more epochs but with these settings you don't need it.