How the T5 text encoder's embedded censorship affects Flux image generation

1) Some Background

After making a Reddit post (https://www.reddit.com/r/StableDiffusion/comments/1iqogg3/while_testing_t5_on_sdxl_some_questions_about_the/) sharing my accidental discovery of T5 censorship while working on merging T5 and clip_g for SDXL, I saw another post where someone mentioned the Pile T5 which was trained on a different dataset and uncensored.

So, I became curious and decided to port the pile T5 to the T5 text encoder. Since the Pile T5 was not only trained on a different dataset but also used a different tokenizer, completely replacing the current T5 text encoder with the pile T5 without substantial fine-tuning wasn't possible. Instead, I merged the pile T5 and the T5 using SVD.

2) Initial Testing

I didn't have much of an expectation due to the massive difference in the trained data and tokenization between T5 and Pile T5. To my surprise, it worked well. And it also revealed some interesting aspects of what the Flux Unet didn't learn or understand.

At first, I wasn't sure if the merged text encoder would work. So, I went with fairly simple prompts. Then I noticed some interesting things.

After testing several comparison generations, I noticed the following:
a) female form factor difference

b) skin tone and complexion difference

c) Depth of field difference

3) Pushing the boundaries

Since the merged text encoder was functioning as intended, I began pushing the prompt to the point where the censorship would kick in to affect the image generated.

Sure enough, the difference began to emerge. And I made some interesting findings about the Flux Unet:
a) It knows the bodyline flow or contour of the human body.

b) In certain parts of the body, it struggles to fill the area and often generates a solid color texture to fill the area.

c) if the prompt is pushed to the area where the built-in censorship kicks in, the image generation gets affected negatively in the regular T5 text encoder.

Another interesting thing that I noticed is that certain words, such as 'girl' combined with censored words, would be treated differently by the text encoders.

Before this, I had never imagined the extent of the impact a censored text encoder has on image generation. This test was done with a text encoder component alien to Flux and should be far inferior to the native text encoder on which Flux is trained.

How the T5 text encoder's embedded censorship affects Flux image generation

1) Some Background

2) Initial Testing

3) Pushing the boundaries

Comments