On the left hand? My left? His left? Use the right description for Flux + A script to fix autocap

So, Flux is great with prompt adherence, right? Right…

but writing directions can be tricky for the model. How would Flux interpret “A full body man with a watch on his right wrist?”. It will most probably output a man, in front view, with the watch on his LEFT wrist, but positioned on the RIGHT side of the image. That’s not what we want.

"Full body shot of a man with a watch on his right wrist" 0 out of 2 here

Sometimes Flux gets it right, but often it doesn’t. And that’s mostly because of how we write our prompts.

A warning first: This is in no way perfect. Based on my experimentation, It helps, but it won’t be 100%.

Describing body parts using the character’s perspective (like “his left”) leads to confusion. Instead, it’s better to use the image’s perspective. For example, say “on the left side” instead of “his left.” Adding “side” helps the model a lot. You can also reference specific areas of the image like “on the left bottom corner”, “on the top-left corner”, “on the center”, “on the bottom”, of the image. Etc.

"Full body shot of a man with a watch on his wrist on the left side" 0.5 out of 2, getting there

NEVER use “his right X body part” ever. “On the left” is already way better than “on his left”, but still generates a lot of wrong perspectives. More recently I have been experimenting with taking “him/her” completely from the prompt and I think it is even better.

"Full body shot of a man with a watch on the wrist on the left side" 1 out of 2, better.

Another example would be:

"A warrior man from behind, climbing stepping up a stone. The leg on the left side is extended down, the leg on the right is bent at the knee. He is wearing a magical glowing green bracelet on the hand on the left side. The hand on the right side is holding the sword vertically upward. The background is the entrance of a magical dark cave, with multiple glowing red neon lights on the top-right side corner inside the cave resembling eyes."

Definitely not all is correct. But it's more consistent.

For side views, when both body parts are on the same side, you can use foreground and background to clarify:

A photo of man in side view wearing an orange tank top and green shorts. He is touching a brick wall arching, leaning forward to the left side. His hand on the background is up touching the wall on the left side. His hand in the foreground is hanging down on the left side.

A photo of man in side view wearing an orange tank top and green shorts. He is touching a brick wall arching, leaning forw... This is way more inconsistent. It's a hit-and-miss most of the time.

Using these strategies, Flux performs better for inference. But what about training with auto captions like Joy Caption? A trend have been going on about the model not needing them, but I still don’t buy it. For simple objects or faces, trigger words might be enough, but for complex poses or anatomy, captions still seem important. I haven't tested enough, though, so I could be wrong.

With the help of ChatGPT I created a script that updates all text files in a folder to the format I mentioned. It’s not perfect, but you can tweak it or ask ChatGPT for more body part examples (I also just recently added "to" instead of only "on").

A simpler and fast option would be to just add “side” after “right/left”. But it would still be ambiguous. For example, “her left side arm” might mean her side, not the image’s side. So you need to include all prepositions like “on the left leg” > “on the leg on the left side”. “On his left X” > “on his X on the left side” etc.

But another big problem is that Joy Caption and all the other auto captioners are very inconsistent. They often get left and right wrong, probably because of the perspective problem I mentioned. So it’s kind of essential to manual check…. That’s why I add <!###-----------###> after each substitution, so I can easily find and check them manually. You can then search and replace that string with Taggui, Notepad++ or another tool.

But manually switching left and right can be tedious. So, I built another tool to make it easier: a floating box to do text swap fast. I organize my window so I can manually check each text file, spot substitutions, and easily swap “left side” and “right side.”

What I did was using the preview panel, I would organize my window just like this:

Manually click on every txt, I could easily spot on the preview panel any txt that had a substitution by looking fro the <###---------####>. Check is it were correct. If not, I could drag the txt and easily swap “left side” <> “right side”.