Mona Lisa Overdrive - How To Tailor Your Images To What You Want For Beginners

I recently got into stable diffusion a few months ago and by no means am I an expert, I just enjoy tinkering with tech even if I don't fully understand everything, I'm a programmer/web developer by trade and the most important thing to understand with using a new piece of technology is how to think abstractly and only deal with finer details if absolutely necessary, otherwise try to deal with problems at higher abstraction levels so you don't get bogged down by the details.

I recently posted this 20 image post featuring the character Trinity from the Matrix. My goal was to try to transform her character into the highest quality (in my eyes) anime style to see if I can outdo what was originally done in the movie the Animatrix 20 some odd years ago. I think I came close. So how did I start?

I started with an image that I really liked on Civitai, and one of my favorite starting points is the GLSHS lora by blacksnowskill, as one of their several styles listed on the BSS Styles found here:
https://civitai.com/models/550871/bss-styles-for-pony

My starting point (this image is posted directly under the showcase which can be remixed right now):

I then add one of my favorite style loras for anime the kenva lora from Pony Custom Styles by alexclerick found here: https://civitai.com/models/366990/pony-custom-styles?modelVersionId=454703

I also added in the base prompt "drill hair" in the positive prompt just to make sure everything was working.

Great, it now has the style that I like. I love the lighting from BSS Styles as well as the lens effects, but also I love the raw gritty 80s halftone retro feel of the kenva anime style, as well as some of the really interesting architecture it can add.

But we're at a problem. This is still 2B and not the original character Trinity I wished to try to replicate. I massively alter the prompt, removing things like "mask, drill hair, white hair, dress" and replace those better describing Trinity aspects, such as "black hair, full black latex bodysuit, black sunglasses, slicked back hair". I'm left with this:

This is extremely close, and most sane people might stop here, but I'm a masochist and said no damnit, this isn't close enough, especially for a portrait that is showing Trinity up close, her hair is still too long.

Heres where you really really have to think outside the box. This issue will come up if you dive deep enough into the rabbit hole of trying to resimulate specific characters, and in this case it was exactly what I was doing. While Trinity isn't an unknown character, she isn't as popular probably due to how old the matrix series is at this point. In the pony model if you ask for Trinity (the matrix) it will draw some random female that doesnt look like her in the slightest, however if you do a more known character, say 2B yorha, it does know who she is. This comes from what the base model was trained upon the original creator obviously put some tags in for 2B, but unfortunately not for Trinity.

I retried searching for "slick back" while still filtering under the pony model because I wasn't giving up the two loras I love to incorporate into the style nor giving up the base checkpoint. So I'm not sure how Civitai filters results of models back to users by default but I do know that thats changeable in the filters section where you can choose to see more relevant to your search query, or weigh it differently to just popularity where you'll see the more popular results.

Regardless I went about 50 or so loras down the page searching. Doing this sort of behavior has a few consequences. One of the glaring concerns is that these are lesser known loras if filtering by popularity, and well they can have some dire effects. I'm not a master at lora creation as I've only ever created a single lora myself, but I do know that a lot of things can go wrong with the applying of loras to an image if they aren't well crafted. Their artstyle can bleed, and affect the overall presentation of your image even if you didn't want it to. It all depends upon how the model was trained. My point for you is to be dubious of using lesser popular loras, as they can often times have dire impacts on the presentation of your images.

Anyways I had time to kill and was open to trying anything at this point. I could not find any sort of style lora to guide Trinity's hair, so what can I possibly do in this instance?

Think a bit outside the box here, why do I need a hair style lora when possibly another character may have her hair style?

I found this lesser known model for some character that I've never even heard of: https://civitai.com/models/601389/lydia-dorfman-mobile-suit-gundam-silhouette-formula-91

The main thing as I said I was concerned about with this lesser known lora is, please please please don't affect the presentation, just the style of her hair and please please dont affect the rest of her body or anything else! And going into this I was 99% sure I was wasting my time and I'd see things like that happening, however, when I hit generate, this is what popped up:

Holy shit. Thats actually Trinity from the Matrix, with extremely subtle Kenva hairstyle undertones that I'm willing to accept!

So there you have it. My Trinity combination of loras, some of which had absolutely nothing to do with the character working together to produce something I wanted.

Lets take this further down the rabbit hole. Lets put her into an action sequence. Should be easy enough through prompting right? I added a further detail enhancing loras, as well as a gun helper lora and changed the prompt to reflect that I no longer wanted a portrait pose, but an action sequence, heres how far I got:

https://civitai.com/images/23548366

Not bad. I think this is acceptable and up to my standards to post. So I posted it. However, I needed more. I wanted her in action sequences directly engaging with another thing or entity.

This can get extremely difficult for SD to comprehend for several reasons. The primary reason being the concept or thing we'll just call "context". If you ask for two entities such as "a man wearing a red shirt with black pants, a woman wearing a white shirt with red pants", you'll very very rarely actually get what you wanted. A typical scenario of this will result in something such as two men wearing red shirts and red pants. Not what you wanted in the slightest.

The new Flux model is much better at this, however due to how young Flux is, it doesn't have the support for stylistic options as older sd1 and sdxl models have with lora support. If I ask for a "kenva artstyle" in Flux, it has no freaking idea what that is unfortunately. If you check the Flux images about 1 out of 50 images is someone memeing with it for pony support (or a pony like modification to the checkpoint or pony esque lora). For example: Typical Flux/Pony Meme. One can hope one day that'll be a reality, but well, we're stuck with sdxl for the time being.

So what do we do? Well theres a few options (and probably more).

1) We can use a controlnet/inpainting to dictate where the people will be in a specific image, however my results were pretty bad. Whenever I tried using a controlnet especially with the pony model and a massive lora stack, blurred nonsense would show up, or two characters would show up in their intended spots but not actually be interacting. Your results may vary and I'm not super keen on advanced controlnet usage, but this was just what I witnessed after experimenting with a huge mound of cables in comfyui trying to specifically use controlnets with ipadapters and masks.

2) We can use image 2 image. Image 2 Image is very good if your goal is to exactly reproduce a specific scene, however the issue I have with it is that I didnt want to exactly replicate a scene, but maybe you do! In that case I'll showcase an image 2 image sample I made within comfyui. I loaded up an image directly from the original movie where Trinity shoots an agent in the head on a rooftop with a helicopter in the background. In this example I'm using something like 35% denoise on the latent, and I got this:

Not the most terrible thing in the world, however the quality was lacking, and my loras obviously weren't in full force because of the low denoise. It no longer even really looks like anime, but a shoddy mix of both reality and anime that isnt the best work I can put out. If I increase denoise, it'll start dramatically changing the image to the point where the scene is no longer recognizable (upping denoise even slightly from .35 to say .4 will drastically change subtle things, for example the agent falling back after getting shot. Upping the denoise even by that little (0.05) he'll no longer in a falling motion, which defeats the whole point of the shot).

If I could recreate this scene with a 100% denoised empty latent, I'd piss myself, but unfortunately thats just not possible with my lack of understanding of elaborate controlnet setups and deep understanding of many underlying systems. Also, we're beginners here right? Using controlnets with ipadapters and inpainting is not the most simple thing in the world and very easy to royally screw up an image to a spattering of colored blurry noise as your output leaving you there with "hmm let me check these 25 points of failure, is it the vae? the checkpoint? the adapter? the adapter plugin? the loras? the latent itself? the upscaler? the sdxl tuple? the hires fix script comfyui node written by joe blows terrible python script with japanese comments 12 months ago that has not been updated?" Yeah I've been there.

So...

whats left?

3) Yolomode. We hit the generation button, and hit the generation button again, and then hit it again. This isn't an ad for Civitai as I personally generate locally when trying for complicated scenes and setups, but its unfortunately what I'm stuck with.

What else can we do than just hitting the generate button? Theres a few tricks we can do that may help you with this yolo approach.

3A) Change image dimensions. If your image is using something like an extreme cinematic portrait of say 768x1344, you're going to get wildly different results than say the base sdxl dimensions of 1024x1024. Why? Because the checkpoints and loras involved love trying to recreate scenes they already have seen with those specific dimensions. My point is, experiment, not only with the prompt, but also things like image dimensions. Unfortunately there's only 3 dimensions available directly within Civitai, but perhaps that'll change in the future, but if you're generating externally, you can really experiment further with extremely wide or extremely long shots that will influence the composition the checkpoint and also the loras will ultimately spit out.

3B) BREAK statements can "somewhat" help when dealing with context swapping. Remember earlier when discussing a multisubject prompt?:

"a man wearing a red shirt with black pants, a woman wearing a white shirt with red pants"

is actually less likely to give you what you want, in comparison to the following:

"a man wearing a red shirt with black pants, BREAK, a woman wearing a white shirt with red pants"

So with that knowledge, lets incorporate it into a prompt where I'm trying the best I can to guide the checkpoint and loras to produce something I wanted. Notice how I'm putting all the lora trigger words at the top, then the rest of the things I want to see in the picture afterwards:

positive:

score_9, score_8_up, score_7_up, score_6_up, masterpiece, high quality, hires, unreal engine, absurd raytracing, reflections, neon, source_anime, concept art, realistic, knva, halftone, expressiveH, GLSHS, from behind, legs together, 
BREAK, 
pistol shootout featuring 2characters, action sequence, on the run, full sprint, 
BREAK, 
Agent Smith (The Matrix), male, receding hairline, black business suit with a white shirt and black tie, black sunglasses, 
BREAK, 
Trinity (The Matrix), female, extremely thin, full black latex bodysuit, black hair, short hair, forehead, hair slicked back, ((femme fatale)), solid black sunglasses, sunglasses cover eyes, (aiming pistol at agent smiths head), full body, 
BREAK, 
Agent Smith and Trinity are shooting pistols at one another, epic shootout, dynamic poses, action sequence,

negative:

3d, score_4, score_3, score_2, score_1, disfigured, nose, muscular, muscles, jacket, looking at viewer, lips, mouth, floor, loose hair, cape, dress, mouth, moon, planet, shoe, sneaker, bangs, wavey hair, voluminous hair, curly hair, 80s hair, flared hair, ponytail bun, hair band, facing viewer, embedding:negativeXL_D,

After about 50 generations, this popped up:

Okay, sick, thats a pretty nice photo. However, who the hell are the two guys on the inner sides? Oh thats right, the AI's context swapped. It put Trinity's clothes description on the two guys in the back.

My solution to this: Learn GIMP/Photoshop.

This is a workable photo, How do I know this is a workable photo? Trinity looks good, and two of the agents look good (even if they dont look like Smith, they do look like Agents), just need to remove the two dunces in the middle. If you can get a shot of just your main two subjects interacting, the rest can be filtered out. EDIT: oh almost forgot, theres an extra leg poking out of the agent on the right, and that too can be filtered out!

I'm unsure of the photoshop equivalent but GIMP has a plugin called the heal plugin. It doesn't come natively with the GIMP install so you'll have to install that on top of GIMP. After drawing around with the lasso tool and applying heal to those unwanted characters, I'm left with this image:

https://civitai.com/images/23549955

Good enough for me. Ship it.

I know many people don't like having to use external tools, and I'm mostly on board that camp because I'm not an artist, I'm a programmer lol. However in dealing with highly complicated scenes, whether its with characters or some insane perspective, or architecture, or whatever, sometimes the best answer isnt to force the AI to do everything for you, but to work alongside it.

Without me guiding the AI through using pretty much everything written in this guide including editting the photo through post processing, I wouldn't be able to get an image as ridiculous and insane as the following:

https://civitai.com/images/23549924

Lets go even further. I loved this last photo so much I decided to tinker with img2vid. I wasn't the most successful with it in comfyui natively, so I decided to check out a few other tools that are very simple for users to experiment with. I unfortunately cant post this video in this article as it's too big to directly inject, but I left the link:

https://civitai.com/images/23766389

Not perfect, but I'm beyond amazed at how easy it was to achieve something like this especially from a beginner like me. This video was generated using all the workflow discussed in this article, with prompt fiddling, checkpoint fiddling, lora configurations, image dimension swapping all within comfyui, exporting the image from there, taking the photo and editting the image through gimp, then taking the image to a service offered from runway, taking that output, and cropping the black lines through the basic microsoft clip champ editor, and then uploading it here to Civitai. I dont know if my workflow is bad, but its what I did to achieve what I wanted.

EDIT:

After speaking with a few people, I've included a workflow in order to achieve the original Trinity entering the machine city. You can see how it adds a few additional people at the bottom that I removed but here it is attached as a comfyui workflow.json. My workflow is very very basic using mostly efficiency nodes and little else.

Mona Lisa Overdrive - How To Tailor Your Images To What You Want For Beginners

Comments