Trained adapter for use of Gemma-3-1b as text encoder for Rouwei 0.8
What is it:
A drop-in replacement for SDXL text encoders that utilises the power of LLM for prompt understanding and conditioning creation.
Kind of same as ELLA, SDXL-T5 and likely others, but this one is focused on anime models and advanced knowledge without censorship.
Why is it important:
SDXL proven to be a good and flexible model that can generate results with great aesthetic and variety at relatively low compute cost and high speed. But prompt adherence is significantly limited by the use of CLIPs. Also dealing with prompt longer than 75 tokens requires its splitting which may distort original meaning.
Replacement of CLIPs with something newer can potentially significantly improve SDXL capabilities in understanding of complex prompts and allow more control at the same time maintaining existing benefits. Also, extra things like images, coordinates, poses from openpose, individual prompts for each characters, etc. can be implemented and work in synergy with main prompt.
How does it work:
Text prompt is being processed by the LLM, then hidden states from last layer are processed by adapter to compensate casual attention and reshape them into conditions for SDXL unet.
Why gemma-3:
Just because it is kind of decent and small model for experiments. Likely during further development it will be replaced with qwen-vl or some other model.
And do not worry, there are no censoring or rejections that may appear in this llm inference. The scheme uses only hidden states that represent "model understanding".
What it can do (now):
First of all - at the current state this is more proof of concept than actually working thing. Considering training budget - a miracle that it even works.
Processing of booru tags like you may used to prompt
Processing of natural language prompts including very short and very long ones up to 512 tokens (gemma tokeniser)
Structured prompts with markdown, xml, json and any other formatting to specify which comes where
Any combinations of these
No tags bleeding as long as it understand what you're giving
So it can work as standard text encoder but provides much deeper understanding for long expressions and can hold more conditions without dissolving into each other.
What it can't do (yet):
May struggle at very complex prompts
Knowledge is inconsistent, it can recognize some very rare character and mess with some more popular one
Same for styles
Use of some artists styles may negatively affect prompt understanding leading to ignoring of some parts
Generate decent quality text
Emphasis (tags weights:1.1) and typical spells
Actually all these will be solved with further training. First requires unet training, 2-4 requires training of LLM because it simply doesn't know and such words give too small reaction. 5 just requires more training (and corresponding dataset) and will be solved soon. 6 requires improvement of custom nodes and will be added soon.
How to run:
1. Install custom nodes for Comfy
1a. Make sure you updated Transformers to version that supports gemma-3
2. Download the adapter (the uploaded checkpoint g3-1b_27k
also available on HF) and put it into /models/llm_adapters
3. Download gemma-3-1b-it (non-gated mirror) and put it into /models/LLM/gemma-3-1b-it
3a. You need all files from original model, not only .safetensors (the best way to download them is to use HF-HUB)
4. Download Rouwei (vpred or epsilon or base) checkpoint if you don't have one yet
5. Use this workflow as a reference, feel free to experiment
Custom nodes paths
Custom nodes have been updated, currently there shouldn't be any issues related to models path.
Prompting:
Use it as you usually do. Just few points:
As already mentioned, use of artist tags might not give you decent results and also can affect prompt understanding making it to ignore some parts. Partially solved by putting it into the very end of your prompt
Move most important and complex parts to the beginning if you suspect that they are ignored
Feel free to use extra description for poses, actions, objects
At the same time it's better to reduce count of "filler tags" and phrases that repeat what arleady is clear
Unlike with clip any misspelling here will lead to complete ignoring or wrong results
Current state of custom nodes does not support prompt weights, standard spells. Also brackets should be left as is, no need to add \
.
Other settings and recommendations are same as for original RouWei
Quality tags:
Positive:
masterpiece
or best quality
you can use them both but it unlikely will give improvement, or just omit them. Keep it clean and avoid extra 'magic combinations' since they will likely give negative effect. Can be placed at the very end.
Negative:
worst quality
or low quality
. Same as for positive. Better to keep it clean only adding things you don't want to appear specificly in this image, not in general.
Knowledge:
It knows popular characters, can mimic artist styles, understand concepts and other things. But, they are limited by the llm that needs to be trained at later stages to get all. Also some more general things are limited by current dataset that consists from anime pictures, and unet abilities.
Compatibility:
Designed to work with Rouwei, should also work with its merges and tunes. May have limited compatibility with Illustrious models, Noobai, other sdxl checkpoints.
Near future plans:
More training including llm and unet
Development of functionality for custom nodes
Training code
...
Training budget:
3 liters of beer, 0.5 liters of coffee, few days on 3x5090 rig.
I'm willing to help/cooperate:
Join Discord server where you can share you thoughts, give proposal, request, etc. Write me directly here, or dm in discord.
Donations:
BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c
ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db
XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ
Thanks:
NeuroSenko (code), Rimuru (idea, discussions)
Also many thanks to those who supported me before:
A number of anonymous persons, Bakariso, dga, Fi., ello, K., LOL2024, NeuroSenko, OpenRoot-Compute, rred, Soviet Cat, Sv1., T., TekeshiX