ChatGPT-4 Multi Modal for Captioning Datasets for Stable Diffusion Training

There are many discussions about the importance and pain that comes from captioning images for Stable Diffusion models. All of us who fined tuned models know well that current auto tagging systems like WD14, BLIP are not so useful (though somewhat helpful) and often too repetitive and genetic. With the recent launch of OpenAI's ChatGPT-4 multi modality, we quickly undertook experimentations in how to ease the pain and process of captioning. We hope you all find this helpful and enjoy easing this task. Please share any improvements you are able to make so that the community of SD can benefit. Thank You. (www.digitalaudrey.com)

PS: We want to credit the author who wrote this instructions below found in a Reddit post. Unfortunately the author's name has been deleted and we don't know who the person is. We want to thank this person for the contribution and in case someone finds him or her, please let us know so we can provide the credit_

Now, create a new session and copy the following text below to your ChatGPT-4 prompt (subscribers only)

Read the following instructions below for Captioning Datasets for Stable Diffusion Training Purposes.

Make sure to follow the proper format recommended in the instructions. the command /new will signal you to start a new project which you will follow with a Q&A asking about the keyword for Globals parameters and you will then caption every image uploaded as per the learnt instructions.

Instructions for Captioning Image Datasets for Stable Diffusion Training Purposes

You want to describe as much detail as you can about anything that isn’t the concept you are trying to implicitly teach. In other words, describe everything that you want to become a variable. Example: If you are teaching a specific face but want to be able to change the hair color, you should describe the hair color in each image so that “hair color” becomes one of your variables. You don’t want to describe anything (beyond a class level description) that you want to be implicitly taught. In other words, the thing you are trying to teach shouldn’t become a variable. Example: If you are teaching a specific face, you should not describe that it has a big nose. You don’t want the nose size to be variable, because then it isn’t that specific face anymore.However, you can still caption “face” if you want to, which provides context to the model you are training. This does have some implications described in the following point. Using generic class tags will bias that entire class towards your training data Using generic class tags provides context to the learning process. Conceptually, it is easier to learn what a “face” is when the model already has a reasonable approximation of “face”. If you want to bias the entire class of your model towards your training images, use broad class tags rather than specific tags. Example: If you want to teach your model that every man should look like Brad Pitt, your captions should contain the tag “man” but should not be more specific than that. This influences your model to produce a Brad Pitt looking man whenever you use the word “man” in your prompt. This also allows your model to draw on and leverage what it already knows about the concept of “man” while it is training. If you want to reduce the impact of your training on the entire class, include specific tags and de-emphasize class tags. Example: If you want to teach your model that only “ohwxman” should look like Brad Pitt, and you don't want every "man" to look like Brad Pitt you would not use "man" as a tag, only tagging it with “ohwxman”. This reduces the impact of your training images on the tag “man”, and strongly associates your training images with “ohwxman”. Your model will draw on what it knows about “ohwxman”, which is practically nothing see note, thus building up knowledge almost solely from your training images which creates a very strong association. Try to avoid repetition wherever possible. Similar to prompting, repeating words increases the weighting of those words. As an example, I often find myself repeating the word "background" too much. I might have three tags that say "background" (Example: simple background, white background, lamp in background). Even though I want the background to have low weight, I've unintentionally increased the weighting quite a bit. It would be better to combine these or reword them (Example: simple white background with a lamp). order matters for relative weighting of tags. Having a specific structure/order that you generally use for captions can help you maintain relative weightings of tags between images in your dataset, which should be beneficial to the training process. Having a standardized ordering makes the whole captioning process faster as you become familiar with captioning in that structure. You want to use descriptive words, but if you use words that are too obscure/niche, you likely can't leverage much of the existing knowledge. Example: you could say "sarcrastic" or you could say "mordacious". Stable Diffusion has some idea of what "sarcastic" conveys, but it likely has no clue what "mordacious" is. General format <Globals> <Type/Perspective/"Of a..."> <Action Words> <Subject Descriptions> <Notable Details> <Background/Location> <Loose Associations> Globals This is where I would stick a rare token (e.g. “ohwx”) that I want heavily associated with the concept I am training, or anything that is both important to the training and uniform across the dataset Examples: man, woman, anime Type/Perspective/"of a..." Broad descriptions of the image to supply context. I usually do this in “layers”. What is it? Examples: photograph, illustration, drawing, portrait, render, anime. Of a... Examples: woman, man, mountain, trees, forest, fantasy scene, cityscape What type of X is it (x = choice above)? Examples: full body, close up, cowboy shot, cropped, filtered, black and white, landscape, 80s style What perspective is X from? Examples: from above, from below, from front, from behind, from side, forced perspective, tilt-shifted, depth of field Action Words Descriptions of what the main subject is doing or what is happening to the main subject, or general verbs that are applicable to the concept in the image. Describe in as much detail as possible, with a combination of as many verbs as you want. The goal is to make all the actions, poses, and whatever else active that is happening into variables (as described in point 3 of “Captioning – General”) so that, hopefully, SD is better able to learn the main concept in a general sense rather than only learning the main concept doing specific actions. Using a person as an example: standing, sitting, leaning, arms above head, walking, running, jumping, one arm up, one leg out, elbows bent, posing, kneeling, stretching, arms in front, knee bent, lying down, looking away, looking up, looking at viewer Using a flower as an example: wilting, growing, blooming, decaying, blossoming Subject Descriptions As much description about the subject as possible, without describing the main concept you are trying to teach. Once again, think of this as picking out everything that you want to be a variable in your prompt. Using a person as an example: white hat, blue shirt, silver necklace, sunglasses, pink shoes, blonde hair, silver bracelet, green jacket, large backpack Using a flower as an example: pink petals, green leaves, tall, straight, thorny, round leaves Notable Details I use this as a sort of catch-all for anything that I don’t think is quite “background” (or something that is background but I want to emphasize) but also isn’t the main subject. Normally the part of the caption going in this spot is unique to one or just a few training images. I predominately use short captions in Danbooru-style, but if I need to describe something more complex I put it here. For example, in a photo at a beach I might put “yellow and blue striped umbrella partially open in foreground”. For example, in a portrait I might put “he is holding a cellphone to his ear”. Background / Location Fairly self-explanatory. Be as descriptive as possible about what is happening in the images background. I tend to do this in a few “layers” as well, narrowing down to specifics, which helps when captioning several photos. For example, for a beach photo I might put (separated by the three “layers”): Outdoors, beach, sand, water, shore, sunset Small waves, ships out at sea, sandcastle, towels the ships are red and white, the sandcastle has a moat around it, the towels are red with yellow stripes Loose Associations This is where I put any final loose associations I have with the image. This could be anything that pops up in my head, usually “feelings” that I feel when looking at the image or concepts I feel are portrayed, really anything goes here as long as it exists in the image. Keep in mind this is for loose associations. If the image is very obviously portraying some feeling, you may want it tagged closer to the start of the caption for higher weighting. For example: happy, sad, joyous, hopeful, lonely, sombre

FULL EXAMPLE OF A SINGLE IMAGE

This is an example of how I would caption a single image I picked off of safebooru. We will assume that I want to train the style of this image and associate it with the tag "ohwxStyle", and we will assume that I have many images in this style within my dataset.

Sample Image: https://safebooru.org/index.php?page=post&s=view&id=3887414

Globals: ohwxStyle Type/Perspective/Of a: anime, drawing, of a young woman, full body shot, from side Action words: sitting, looking at viewer, smiling, head tilt, holding a phone, eyes closed Subject description: short brown hair, pale pink dress with dark edges, stuffed animal in lap, brown slippers Notable details: sunlight through windows as lighting source Background/location: brown couch, red patterned fabric on couch, wooden floor, white water-stained paint on walls, refrigerator in background, coffee machine sitting on a countertop, table in front of couch, bananas and coffee pot on table, white board on wall, clock on wall, stuffed animal chicken on floor Loose associations: dreary environment All together: ohwxStyle, anime, drawing, of a young woman, full body shot, from side, sitting, looking at viewer, smiling, head tilt, holding a phone, eyes closed, short brown hair, pale pink dress with dark edges, stuffed animal in lap, brown slippers, sunlight through windows as lighting source, brown couch, red patterned fabric on couch, wooden floor, white water-stained paint on walls, refrigerator in background, coffee machine sitting on a countertop, table in front of couch, bananas and coffee pot on table, white board on wall, clock on wall, stuffed animal chicken on floor, dreary environment The best part is, I can set all of those "global" ones in BDTM to apply to all of my images. I've now also got all of those tags ready just a double-click away, so if my next image is also a full body shot, from the side, sitting... I just double-click it. Much easier than typing it out again.

ChatGPT-4 Multi Modal for Captioning Datasets for Stable Diffusion Training

Comments