Images generated from SDXL and DALL-E 3 are logged in <a target="_blank" rel="ugc" href="https://docs.google.com/document/d/16_F83I5sJR4gbCbD4t1wCWUOT-NnMY14X6YblUB76Pc/edit?usp=sharing">this Google Doc</a>. I might add more in the future.Training anime models with Booru tags is common practice, but the inherent ambiguity of the tag-based captioning can be troublesome at times and necessitates a richer language. I asked GPT4-V (Bing) to generate some image captions from some handpicked images and used the same captions to <a target="_blank" rel="ugc" href="https://docs.google.com/document/d/16_F83I5sJR4gbCbD4t1wCWUOT-NnMY14X6YblUB76Pc/edit?usp=sharing">generate images</a> with SDXL (Clipdrop) and DALL-E 3 (Bing Image Creator). Some thoughts/observations:<ul><li>Though not quite at the same level as DALL-E 3, you can go quite far with natural captions with SDXL in some cases. I'm surprised at how accurate the gens get with the hanging lights image.</li><li>SDXL struggles with more complex prompts. It's still better than random though.</li><li>DALL-E 3's safety checker is way too aggressive, frequently blocking innocuous prompts.</li><li>GPT4-V's captioning seems to follow a certain template.</li><li>Like most language models, GPT4-V hallucinates.</li><li>Bing blurring faces is annoying.</li></ul>This shows that training anime models with natural language is a reasonable option (<a rel="ugc" href="https://civitai.com/models/128351/anime-natural-language-xl">as some seems to have already done</a>). Hope we get more base models following this direction.

306c260c-7176-4caf-a242-c0f522ffb2af

Comparing Natural Captions with SDXL vs DALL-E 3: Not All That Bad

sexual situations

physical violence

disturbing

male nudity

hanging

hate symbols

nazi party

revealing clothes

weapon violence

female swimwear or underwear

male swimwear or underwear

partial nudity

white supremacy

adult toys

graphic male nudity

illustrated explicit nudity

nudity

graphic violence or gore

graphic female nudity

pg-13

corpses

wide hips

convenient censoring

peeing

oral invitation

emaciated bodies

exposed female nipple

blowjob

female nudity

sexual activity

sexual intent

undressed

male underwear

female swimwear

genitals

female underwear

thick thighs

breasts out

strapless leotard

vore

breast out

one breast out

huge breasts

gigantic breasts

huge butt

covered nipples

hair over breasts

no panties

sitting on face

anal

dildo riding

downblouse

oral

porn

futanari

hentai

nude

lingerie

nsfw

suggestive

child on child

self injury

extremist

hate speech

diapers

urine

incest

scat

sexy

latex clothing

swimwear

bukkake

fellatio

cumshot

implied fellatio

eat_cum

cumdrip

cum in pussy

cum on face

after fellatio

cum on hair

cum on body

cum on tongue

cum on hands

cum in mouth

triple fellatio

autofellatio

fucked silly

cum on pussy

pov fellatio

Comparing Natural Captions with SDXL vs DALL-E 3: Not All That Bad

Comments