Upscaling images by IMG2IMG, my basic guide

Intro

Modern models know not so many human body poses. Some know more than others - like ponyDiffusion, but its knowledge quite... specific. And if you have some pose in mind - you can't always describe it in all the details, including view angles and the such. And even if you can describe them the way humans will understand you - models won't always do too. Of cause there's LoRAs, but even they're not all-powerful. So there's a use for IMG2IMG tool and reference images for it.

My "data preparation"

I personaly have storage of snapshots that I take from everything you can take screenshot of - videogames, movies, tv-shows, youtube videos, tik-toks, camgirl streams and of cause classic porn. In addition to just pics you can get from any corner of internet.

I take pictures of battle-scenes, where you can see people in interesting superheroic poses shot from unusual view-angles, or pictures of CGI-effects, or landscapes - basicaly everything interesting. Images don't take space at all, on modern drives.

This snapshots I store sorted and resized, so when I cut person from frame, resulting size isn't very large, but thankfuly SDv1.5 models can nicely take a grip of even 400*300px image.

Workflow

So my pipeline is really simple and easy.

1. I cut person from frame. I try always use some standard aspect-ratios: like 0.666-2:3 or 0.750-3:4 or 1-1:1. When you use something strange like 0.5-1:2 you even can have some artifacts on resulting picture.

2. I go to IMG2IMG tab in Automatic1111 and scale original image to comfortable for SDv1.5 resolution: its about 768px per bigger side, or kinda 600*800px or close. Sometimes you have to upscale source, sometimes downscale it. I use random seed and go from 0.30 denoising strength (DS) up to 0.60-0.70 with a step of 0.05. With DS over 0.55 neural models usually add too much own fantasy to image - they change phisique, pose and shot-angle, destroying original idea. And this is not always bad. Normaly best for me is 0.42-0.54 DS.

I don't write sophisticated prompt, I go with basic scene description - model can take from source image enough data.

Common way is to make a lot of variations and choose what you see as best. Even on starter-level modern GPUs, time to generate one not-upscaled image by SDv1.5 models is very quick. On my RTX 4060ti 16GB its 2-6sec per image with DPM++ 2M Karras 34 steps, I usually use this preset.

3. When I've done enough variations, I take one of them and replace source image at IMG2IMG tab with it. I don't change prompt. I just choose an upscale to x3-x4 times - it's maximum I can do with 16+16GB of shared VRAM using SDv1.5. Resulting size is about 2300*2300px or 2000*3000px. I set DS to 0.32-0.38 - its quite important, I think at anything over 0.40 DS you too often get too many artifacts. I use standard R-ESRGAN upscaler, I can't see difference from -anime6B option or 4x-UltraSharp.

4. Sometimes I iterate that workflow more than one time - downscale resulting upscaled image to 768px by bigger side, to get more style on new source picture, and get better upscale from it.

Examples

And that's It. In posts bellow are examples.

https://civitai.com/posts/2572012

https://civitai.com/posts/2555236

https://civitai.com/posts/2570159

https://civitai.com/posts/2550788