This pipeline works well if you have a general idea of what you want to create. This is an intermediate explanation of how I make images, and it works well for me. It can successfully get what I am trying to create most of the time. Of course, like everyone we are bound to the limits of the training sets of the models we are using. This isn’t some magic that is going to make SD spit out exactly what you are looking for. But this helps me by adding structure to what I am doing so I can get as close to the image in my head as possible.
To follow this guide with accuracy you want to order your promp as shown by each step (Step 1 prompt), (Step 2 prompt), ect
For this pipeline you want to start only using the base model. For this you want to prompt in the basic composition first before you get into the details. Think of what you want in the image and then break it down to its most basic components. Ignore the details at this step. If you are trying to make something outside of conventional norms, start with the most conventional equivalent (Something that looks similar in shape to what you are trying to make)
For this example, I am making a witch and I want swirling “magical” energy around her and a simple black background. So, for that I start with the broad strokes. The image on its basest level needs – a girl, in a black dress, and a hat, a black background. For the prompt you may think <1girl, black background, pointy hat, black dress> but this can be simplified further. A large number of witches already have pointy hats so that prompt is not needed and may even hurt the prompt because (almost) all witches have pointy hats but not hats that are pointy are witch hats. So simply prompting <1girl, witch, long black dress> in my experience has given better results. Of course, your familiarity with the checkpoint is going to play a big part in how you prompt so don’t take my word for it.
This will give you a stable starting point that for this method to work should give good results 99% of the time. A good way to test is to generate many images and make sure the prompt is stable. What you don’t want at this stage is any major (taking up a large amount of the image) weird anatomy or corruption.
After you have the basic composition think about the details of the image, what makes a witch a witch? Magic, books, odd trinkets? Perhaps glowing eyes? These are the questions to ask about the thing you are trying to make. This is where the rubber meets the road so to speak. You want to add words to your prompt that correlate with images containing things that add the details you want. You can think of this in many different ways but the three that work best for me is:
1 direct prompting (Prompting the exact things you want) this works well for things that checkpoint has a lot of data on but fails when trying to prompt things that the checkpoint does not have a lot of training data on,
2 indirect prompting (Prompting things where a portion of the idea is relevant to the image I am making) This is a bit more difficult to get right and can lead to instability in the generated images especially if the concepts the prompt represents are too far apart visually. This can be really cool because you can get images that are wildly far from what the words mean.
3 negative prompting (prompting things you don’t want to see) When you are adding general prompts – hat, dress, hair these kind of prompts include concepts you may not want in your image. For example in prompting a particular hair style the training data includes a hair tie you don’t want to see. You can try to neg-prompt something to counteract it adding that. Keep in mind that something as specific as <hair tie> may not be in the training data so might need to get creative with some indirect negative prompting. Things like <hair accessory> may work better or <scrunchy>.
Keep in mind that the training data has the same biases and lingo as the person creating it. So one person may call a concept on thing but another person may call it something else. If you can understand what community that person is apart of then you can figure out what they would call something based on the lingo used in those circles.
The most extreme example I can think of is photographers and furries. For a photographer, they may add prompts to their checkpoint like photography techniques, the type of camera used, the names of poses and levels of detail. While furries may add prompts to their checkpoints like, style of art, emotions, they have different names for poses.
Breaking this down even further the process goes something like this
Person looks at image > Matches words with concepts based on their lingo and biases > stable diffusion uses that data to make an image. So by “getting in the head” of the creator you can gain insight into how they made the checkpoint and allows you to more accurately prompt the concepts you want. Reading the description and looking at examples of (the creator of the checkpoint’s) generated art is a good way to understand how they think. If you still find yourself struggling to prompt the things you want, it may be a cultural disconnect. Try to understand the culture and community that that person is apart of, read posts and get in touch with the lingo or move to a model that is better matches your understanding of word-concepts
In this example the model that I am using (Rev animated) does not have a lot of magic training data so I looked for LORA that had the style of magic I was looking for. But for you try to add things that visually push the image closer to your “ideal image”. You can do things like adding celebrities at different weights to get different faces or adding lora to get concepts that your checkpoint doesn’t have data on. Or using open pose to get the pose you want or any number of things. Go wild but remember that you are working with the base you made in step 1 and getting too far from that base conceptually will add instability to the generated images.
Finding the right seed
Once you have the details right, go seed hunting. I like to blast out images in large batches until I find one I like then proceed to the next step.
When you have the details right you may have a fair bit of corruption especially when you have prompted images with little conceptual overlap. There are many ways of cleaning up the image and this is by no means an exhaustive guide. But some things that work well for me is:
Running the image through img2img - At <40-60 Denoising, resize by 2, and the same prompt> this will change ALL the details of the image (amount controlled by denoising strength). Adjust denoise until only the details you want changed are fixed. Also changing the prompt and weights of the prompt is a useful tool for adjusting the fine details.
Using Highres fix - I just started using this and from my experience it would be good to have this on from step 1 as it significantly impacts the final image. But using set seed and running hires fix with different upscalers is a pretty cool way to play with the final image.
Upscaling – if you have very low corruption you can skip straight to this step but its pretty simple, experiment with different upscalers to get the image that looks the best to you
Things of note:
If your image becomes corrupted you can always get the generation data from a stable prompt and retrace your steps. Once you get good at matching concepts>images you will be able to figure out which prompts are causing the corruption.
A good place for beginners is to copy prompts from images made using wildly different models and seeing the difference between the the image you got the prompt for and the one your checkpoint created.
If a word is not in the training data for the checkpoint it may omit parts or the whole prompt. For example: the prompt is <red dress>. Checkpoint does not have "red dress" in its training data.
Outcome 1 - dress is in the training data so it only generates a dress
Outcome 2 - red is in the training data but not dress so it makes other things red
Outcome 3 - neither red nor dress is in the training date so it ignores the prompt
this can get complex when you use descriptors eg torn, fluffy, velvet, regal as the outcome can be heavily influenced by the biases and the subjective nature of describing something.
A resource you are using my be inherently unstable, a prompt may correlate to many subjectively different images. And so the output of two images with the same parameters but different seeds can be wildly different. This is because the stability of the image can be thought of as how close the image, concept & prompt are to each other. when all three are working together you get very stable generation.
for things that are either 1 hard to describe (aliens/monsters)
2 we don't have the words to describe it (How do you describe the visual difference between a grasshopper and a bee?)
3 is a prompt with large variation in image (a mustang is a car but it is also a horse).
Why this is important - lets say you want to make a car that looks like a bee. you can take a direct approach by just prompting <car, bee> but this may not create what you are looking for. Breaking it down and taking an indirect approach would be better but if the checkpoint does not break down the bee & car into small enough concepts to put together and there is no lora for it then you are sol.
I would want to break it down into its smallest parts, wheels, black and yellow stripes, windshield, mustangcar ect but the recourses I am using may not get specific enough to be able to put these parts together...