Virtual photoshoots manual.

Hello everyone!

In this article, I will describe my experience with virtual photoshoots, which at the current moment (2026) are already no worse than real ones in quality and surpass them in creativity, and most importantly, in ease of creation. There will be a lot of technical details on how, in my opinion, it is best to train LoRA, how to select training photos for a character, and which generative models are convenient to use.

Unfortunately, this information is quite scattered, and I have still not encountered a single such complete guide on the internet, so I will try to fix that.

First, I will describe the task I am solving. My goal is to create a set of beautiful, interesting photos for a specific real person, as in a real photoshoot. The resemblance to the person should be close to 100%, and the realism of the photos should also be close to 100%. Looking at the resulting photos, it should not give the impression that this is "just another AI caricature that can now be obtained in 5 minutes on every second website on the internet."

My goal is to do this on my computer, without censorship, without an internet connection, and without paying third-party services.

At the same time, it is desirable that the task be solved quickly and does not require much effort from me or from the person I am photographing. Based on these conditions, I chose a suitable process. My reasoning for selecting the process will also be provided below. Perhaps I am not entirely right somewhere, but this is the best thing I could come up with, and it definitely works.

So:

1. Choosing a generative model.

For good photorealism, SDXL, Qwen Image 2512, Chroma, Z-Image Turbo, and Z-Image Base are suitable. Flux is too plastic-looking, unfortunately. SDXL is lightweight but does not understand prompts very well by modern standards. Chroma is not very stable, especially when it comes to creating 100% similar characters. Z-Image Turbo is stable when generating people, but not very creative, and the facial features of LoRA characters can be slightly altered there; I still haven't found a way to fix that. Qwen Image is good but heavy; training LoRA for it on a home computer takes quite a long time for most users. Z-Image Base remains - it meets all conditions: very photorealistic, fairly easy to train, characters are very stable, and generation is quite fast with the new 5-step distillation LoRA for acceleration.

2. Choosing training software.

I tried Kohya, Ostris, and OneTrainer. Kohya is old; I'm not sure if it supports Z-Image. Ostris is beautiful and pleasant, capable of creating not only LoRA but also LoKr, which are 10-20 times smaller. But it is not as accurate and convenient to configure as OneTrainer, has fewer VRAM optimization options, and on my system it trains slower and less stably. Therefore, I chose OneTrainer. It definitely works.

3. Choosing a computer and graphics card.

The most critical factor is the amount of video card memory. For operation without terrible slowdowns, you need at least 16GB. You can manage with 8GB if using an SDXL model, but we chose to use Z-Image Base, which requires at least 16GB for LoRA training. I highly recommend 24GB, because under current conditions it is the standard, and everything will definitely run fine for you, both LoRA training and generation in ComfyUI with all add-ons and workflows. And even LoRA training for Qwen Image. Personally, I have an old RTX 3090 with 24GB; its price is quite comparable to newer 40XX and 50XX graphics cards with 16GB, even cheaper.

The second most critical factor is a fast SSD. All these tens of gigabytes that model files weigh are constantly transferred back and forth to memory, and if your drive is slow, you will suffer. It's not even as critical how much RAM you have in your computer; personally, I have 16GB and it is quite enough for comfortable work. During LoRA training, the computer will still lack both 16 and 32GB of RAM; it will swap to the SSD, but if the SSD is fast, there is no problem with that, it still works quickly.

The CPU is not critical at all; the graphics card does all the work.

4. Selecting a training photo set for the character. Very important.

Okay, here is where the magic begins, learned through experience. The minimum set of photos for full LoRA training is 6 pieces. Three full-body shots for the figure: from the front, side, and back. Three close-up headshots: from the front, side, and back. This is the minimum so that the model later draws exactly your character, not something approximate. It's better to have more photos. But, the more training photos, the longer the LoRA will take to learn. Therefore, you need to think carefully about whether to add a particular training photo to the dataset, each time asking yourself "what new thing about the character will the LoRA learn from this photo?"

The ideal set of training photos is as follows:

- One vertical full-body photo frontally, against a studio/wall background or neutral nature without distracting elements, in tight-fitting clothing, preferably a swimsuit, height 1024 pixels, width 512..800.

- One vertical full-body photo from the side, on a different background, preferably in clothing of a different color and style, but also tight-fitting/open. 1024x(512..800).

- One vertical full-body photo from the back, on a third background, following the same principle as the first two.

- One vertical full-body photo frontally, also with a neutral but different background from the previous photos, also in tight-fitting/open clothing, but now width 1024 pixels and height 512..800. If not added, the model may later get confused about the person's body width when generating horizontal or square photos.

- 2..4 photos of the person's entire figure in various characteristic poses for them (sitting/lying), in different characteristic clothing. Preferably 768 pixels on the long side and different aspect ratios (square/horizontal/vertical). The LoRA will learn these characteristic poses, which will add resemblance later during generation.

- 2-3 square (or almost square) photos of the person's head, with different hairstyles and different facial expressions characteristic of this person. 1024 pixels on the long side.

- 2 head photos from two different sides and one photo from the back, with a non-voluminous hairstyle, so that the LoRA learns what shape the person's head is and how their facial features look from the side. 1024 pixels on the long side.

- Another 2-3 photos of the person's head, with different hairstyles and different facial expressions characteristic of this person. 768 pixels on the long side.

- And another 2-3 waist-up photos, in different characteristic clothing and poses for the person, 768 pixels on the long side.

In total, you get 15..20 photos with different angles, different backgrounds, different hairstyles, and different facial expressions. No more is needed; these photos already contain all the necessary information for LoRA training. All photos must be of good sharpness and correct color balance. If this is not done, the LoRA will learn the style from the photos and will push it into generated images, limiting creative potential and simply being annoying. Also avoid photos with strange complex poses - such things are harder to learn, and during training there will be glitches in understanding body proportions.

Such a set learns to perfection in about 2500-3000 steps; on an RTX 3090, this is two to three hours. Optimal.

5. Creating descriptions for training photos. Also important.

The general principle for creating photo captions is as follows: given the description, the model should draw approximately the same image as the training one, but without taking into account the features of the character we are actually teaching. Then the LoRA training process will consist of learning exactly these specific traits, rather than everything indiscriminately. This will be both faster and more accurate.

Manually describing the image in full is long and tedious. Use any sufficiently smart Img2Text model that can do this, ask it to describe the image in all its completeness, to create a prompt by which you could draw such a photo.

It is better not to specify eye color, hair color, and body type if there are no plans to change them in generated photos later. If specified, it will be easier to change them during generation later, but at the same time, to preserve character resemblance, you will have to mention him/her with all details every time, otherwise it will turn out not very similar.

Whether to assign a unique tag to each character is an open question. Theoretically, it is believed that if all descriptions start with a unique character tag, the LoRA will remember it and later you can simply mention this tag to have them drawn. This works to some extent, but just as well, the LoRA learns the character's appearance simply from the man/woman classifier. And since combining two character LoRAs in one frame without blending their features is currently not possible for any model, I see no point in it, only more work on writing tags. If we want 100% resemblance, a tag won't help us anyway. I don't use them.

6. Configuring OneTrainer.

I am describing these settings for a graphics card with 24GB of memory so that everything works at maximum speed.

- General window.

Select "Z-Image" and "LoRA" in the top right.

Enter "cuda" as the value for Temp Device. If you leave it as "cpu", less VRAM will be used, but there will be delays during backup creation and sampling related to transferring model parts back and forth.

- Model window.

In Base Model, you need to specify the path to the local Z-Image model in diffusers format. Before that, of course, find and download it.

Model Output Destination can be set to anything, because we will not use it in 99% of cases, but will instead use one of the intermediate LoRA versions.

You can enable Compile Transformer Blocks; it will train slightly faster. If disabled, about 3GB of VRAM will be freed up.

Let Transformer Data Type be bfloat16; this is maximum quality, does not require quantization, and fits in VRAM. If memory is low, set it to float8 or even int8.

SVDQuant: disabled, if the previous item is bfloat16. If the previous item is int8, try enabling it by setting quantization to bfloat16 and rank 16. This will take a couple of gigabytes of VRAM, but training will be more accurate.

Set Text Encoder 1 Data Type to nfloat4, since you will not be training it, and completely disabling it and unloading from memory is not possible. It doesn't work for me, maybe it will for you.

Output Data Type: bfloat16 is quite sufficient for a character LoRA.

- Data window.

Enable everything. Cache clearing can be skipped, but the savings will only be about 10 seconds at startup, so it doesn't matter.

- Concepts window.

As usual, add a Concept with the character's photos. All photos once, no need to flip or double the dataset (otherwise the slight asymmetry of your character's face will disappear and they will become less similar to the real person). No need to drop tags either (all our photos are described in normal human language). Just add the path to the folder with photos, name the concept as you like, and leave other values by default.

- Training window.

Optimizer = PRODIGY. Yes, PRODIGY, not ADAMW.

Learning Rate Scheduler = COSINE. Otherwise, your PRODIGY will only grow and the LoRA will overlearn.

Learning Rate = 1.0 Yes, exactly 1.0

Learning Rate Warmup Steps = 0

Epochs = 250 if there are about 10 photos in the dataset, or 200 if about 20.

Local Batch Size = 1. Yes, 1, not 2 as everyone advises. 2 is needed to compensate for dataset inaccuracies, but we have a very well-prepared and very accurate dataset; in it, you can train each photo individually, extracting maximum information from it.

Accumulation Steps = 1. The reason is the same, see previous item.

Gradient Checkpointing = ON, otherwise there won't be enough VRAM.

Train Data Type = bfloat16

Resolution = "1024, 768, 512". Yes, three values separated by commas, exactly like that.

Leave the rest by default. Offset Noise Weight and Perturbation Noise Weight are also 0.0, because when training Z-Image, you will not have slow convergence, but rather stepwise/jumpy convergence.

- Sampling window.

Sample after 50 STEPS, Skip first 500.

Add two samples: 768x768, seed 41, prompt: "woman head photo, cfg=4, steps=20" and 1024x1024, seed 41, prompt: "woman in underwear dancing in dynamic pose, cfg=4, steps=20".

Thus, every 50 steps you will have an indication of how well the character's face has been learned and how well/distorted the figure has become.

- Backup window.

Backup After 50 STEPS

Rolling Backup - Yes, Rolling Backup Count = 2

Save Every 50 STEPS, Skip first 500.

Thus, an intermediate LoRA will be saved every 50 steps. And later, based on the samples, you can choose the one where both face and figure turned out best simultaneously.

- LoRA window

LoRA Rank = 32

LoRA Alpha = 1.0

The rest: default. Probably, you can experiment by setting Rank=16, and that will also be enough, but 32 works for all characters, for all datasets, and practically for all models, so I see no point in playing with it.

With this, the configuration is complete and you can launch OneTrainer.

I will also describe the settings for Qwen Image 2512 just in case.

I warn you again - it is heavy, LoRA trains slowly, and an RTX 3090 is the absolute minimum for everything to work at all, in principle. I think comfortable use is only possible starting with an RTX 5090 with 32GB of VRAM.

I have selected two configuration options for Qwen - one is higher quality but slower, the second is faster but lower quality. The high-quality option works with transformer quantization at float8 and does not fit entirely in video memory. The fast one uses nfloat4 quantization, fits entirely, but character resemblance may turn out unsatisfactory. If you decide to train with nfloat4, almost all settings can be left as for Z-Image, except for Optimizer, Learning Rate Scheduler, and Learning Rate settings. And it's also better to enable SVDQuant = bfloat16 with rank = 16 for nfloat4; it should turn out slightly higher quality. Below I will describe the settings for float8.

The settings are as follows:

- General window.

Select "QwenImage" and "LoRA" in the top right.

Temp Device = "cpu"

- Model window.

In Base Model, you need to specify the path to the local Qwen Image 2512 model in diffusers format. Find and download it beforehand.

Compile Transformer Blocks = Off.

Transformer Data Type = float8.

SVDQuant = disabled.

Text Encoder 1 Data Type = nfloat4.

Output Data Type = bfloat16.

- Data window.

Enable everything. If you don't clear the cache, savings will be 2-3 minutes at startup.

- Concepts window.

The same as for Z-Image.

- Training window.

Optimizer = ADAMW_8BIT.

Learning Rate Scheduler = COSINE.

Learning Rate = 0.0003

Learning Rate Warmup Steps = 50

Epochs = 500 if there are about 10 photos in the dataset, or 300 if about 20.

Local Batch Size = 1.

Accumulation Steps = 1.

Gradient Checkpointing = CPU_OFFLOADED.

Layer offload fraction = 0.06. Because training LoRA for Qwen Image 2512 uses approximately 26.5 GB of video memory, and we only have 24 GB, so a little bit will still have to be offloaded. The minimum value of 0.06 was selected empirically; if less - CUDA out of memory, if more - longer training.

Train Data Type = bfloat16

Resolution = "1024, 768, 512".

Leave the rest by default.

- Sampling window.

Sampling - NEVER.

It will be much faster to manually review all intermediate LoRAs later. One sampling cycle of two examples with model loading/unloading takes about 20 minutes; there is no point in waiting that long.

- Backup window.

Backup After 300 STEPS

Rolling Backup - Yes

Save - every 300 steps, skip first 900.

- LoRA window

LoRA Rank = 16

LoRA Alpha = 1.0

LoRA Weight Data Type = float32.

Training will take a long time, approximately 6 seconds per iteration on a 3090. It samples especially slowly, so we disable sampling. The entire process takes about 9-10 hours, unlike 2-3 hours for Z-Image. You can quite safely stop at step 4500. Usually by step 1100 it's already roughly clear what will turn out, but the character is not yet completely accurate.

If you have an RTX 5090, you are lucky; you will be able to set Gradient Checkpointing = ON and everything will be much faster, I think about 3 hours for everything, including sampling.

7. Choosing a good LoRA from several.

After training, you will likely get several good LoRAs. There should be clear fingers, light tones, correct anatomy. Take the best one from those closer to the end of training. And test it in ComfyUI for how similar and stable the character is. If not satisfied - try the previous one, etc.

8. Generating the photoshoot itself using the character LoRA.

Of course, you can come up with your own prompts, find prompts online, or run liked other people's photos through an LLM for description as when preparing the dataset. But it is easier to do it differently.

Take any Img2img workflow suitable for Z-Image. Find and insert a 5-step distillation LoRA so that everything generates quickly. Insert your character's LoRA. In KSampler, set denoise=0.72, steps=5. Or in Advanced KSampler, set 7 steps and start step = 2. Then take a liked photo, feed it to the KSampler input, and launch generation with your character based on it, writing simply "woman" or "man" in the prompt. The result is something similar to the composition from the example, while also being very characteristic of your character and at the same time quite photorealistic.

For Qwen Image 2512, generation is slightly more complex. You will need to not only do all the same things I described for Z-Image, but also add a module to recognize the source photo with some smarter LLM, and feed the resulting description into the prompt input. Do not forget to instruct the LLM module not to describe eye color, body type, character age, and hairstyle, so they don't interfere with our character's appearance.

There is more work with Qwen Image, but the result can be even better than with Z-Image if you want to thoughtfully create beautiful artistic images.

The Z-Image workflow is excellent for mass application; in it, within 4-5 hours you can manage to prepare a dataset, train a LoRA, generate 200 photos based on examples, select the best 50 from them, and deliver a result to the client no worse than after a regular photoshoot.

That's all, good generations to you :)