Sign In

Virtual photoshoots manual.

0

Virtual photoshoots manual.

Hello everyone!

In this article, I’ll describe my experience with virtual photoshoots, which at this point (2026) are no longer inferior to real photoshoots in quality and surpass them in creativity and, most importantly, ease of creation. There will be a lot of technical details about what I believe is the best way to train a LoRA, how to select training photos for a character, and which generative models are convenient to use.

Unfortunately, this information usually is quite fragmented, and I have yet to find a single comprehensive guide on the subject online. I’ll try to fix that.

First, let me describe the problem I’m solving. My goal is to create a collection of beautiful and interesting photos for a specific real person, similar to what you would get from a professional photoshoot. The resemblance to the real person should be as close to 100% as possible, and the realism of the photos should also approach 100%.

When looking at the resulting images, people should not think: "Oh, it’s just another AI caricature that can be generated in five minutes on every other website."

My goal is to do all of this on my own computer, without censorship, without internet access, and without paying for third-party services.

Ideally, the process should also be fast and require minimal effort both from me and from the person being photographed. Those requirements heavily influenced my choice of workflow. I’ll explain my reasoning throughout the article. I may not be right about everything, but this is the best workflow I’ve come up with so far, and it definitely works.

So, let's begin.

1. Choosing a Generative Model

For high-quality photorealism, the following models are suitable: SDXL, Qwen Image 2512, Chroma, Z-Image Turbo, and Z-Image Base. Unfortunately, Flux looks too plastic. SDXL is lightweight but, by modern standards, does not understand prompts particularly well. Qwen Image is excellent but very heavy. Training LoRAs for it on a home computer is nearly impossible for most users. Chroma is not particularly stable, especially when creating characters that need to remain 100% consistent. Z-Image Turbo is stable for generating people but is not especially creative. Facial features of LoRA-trained characters also tend to drift slightly, and I never found a reliable way to eliminate that issue.

That leaves Z-Image Base. It satisfies all requirements: highly photorealistic, relatively easy to train, very stable character consistency, and reasonably fast generation when combined with the new 5-step acceleration LoRA.

2. Choosing Training Software

I have tried Kohya, Ostris, and OneTrainer. Kohya is quite old, and I’m not even sure whether it properly supports Z-Image. Ostris has an attractive interface and supports not only LoRA but also LoKr models, which are 10–20 times smaller. However, it is not as precise or configurable as OneTrainer. It offers fewer VRAM and RAM optimization options, and in my experience it trains slower and less consistently.

Therefore, I chose OneTrainer. It definitely works.

3. Choosing a Computer and GPU

The most critical factor is GPU memory.

To work without unbearable slowdowns, you need at least 16 GB of VRAM. You can get by with 8 GB if you use SDXL, but we chose Z-Image Base, and training LoRAs for it requires at least 16 GB. I strongly recommend 24 GB because it has effectively become the standard. With 24 GB, LoRA training, ComfyUI generation, complex workflows, and various extensions will run without issues. Personally, I use an older RTX 3090 with 24 GB of VRAM. It is often comparable in price to modern 16 GB cards from the 40XX and 50XX series, and sometimes even cheaper.

The second most important component is a fast SSD.

These model files occupy tens of gigabytes and are constantly being moved between storage and memory. If your drive is slow, you will suffer.

RAM capacity is actually less important. I personally have 16 GB and it is perfectly adequate. During LoRA training, the system will exceed both 16 GB and 32 GB of RAM anyway and start swapping to the SSD. As long as the SSD is fast, this is not a problem and performance remains acceptable.

CPU performance is almost irrelevant. The GPU does virtually all the work.

4. Building the Training Dataset (Very Important)

This is where the real magic begins.

The absolute minimum dataset for proper LoRA training is six photos:

  • Three full-body photos: front, side, and back.

  • Three close-up head shots: front, side, and back.

That is the minimum.

More photos are better, but additional photos increase training time. For every image you add, ask yourself: "What new information about the character will the LoRA learn from this photo?"

The ideal dataset looks like this:

  • One vertical full-body front-facing photo against a studio backdrop, wall, or neutral natural background without distracting elements. Tight-fitting clothing is preferred; a swimsuit is ideal. Height 1024 pixels, width 512–800 pixels.

  • One vertical full-body side-view photo on a different background, preferably with different clothing color and style, but still tight-fitting or revealing. 1024×(512–800).

  • One vertical full-body rear-view photo on a third background, following the same principles.

  • One additional front-facing full-body photo on a different neutral background, also with tight-fitting or revealing clothing, but this time with a width of 1024 pixels and height between 512–800 pixels. Without this image, the model may become confused about body proportions when generating horizontal or square images.

  • 2–4 full-body photos in characteristic poses (sitting, lying down, etc.) and characteristic clothing. Preferably 768 pixels on the long side, using a mix of square, horizontal, and vertical formats. The LoRA will learn these characteristic poses and improve resemblance during generation.

  • 2–3 square (or nearly square) headshots featuring different hairstyles and facial expressions characteristic of the person. 1024 pixels on the long side.

  • Two side-profile headshots from different sides and one rear headshot with a non-voluminous hairstyle so the LoRA learns head shape and side facial features. 1024 pixels on the long side.

  • Another 2–3 headshots with different hairstyles and facial expressions characteristic of the person. 768 pixels on the long side.

  • 2–3 waist-up photos featuring characteristic clothing and poses. 768 pixels on the long side.

This results in approximately 15–20 photos with varied angles, backgrounds, hairstyles, and facial expressions. You do not need more. At that point, the dataset already contains essentially all the information required for LoRA training.

All photos should be sharp and have proper color balance. Otherwise, the LoRA will learn the photographic style itself and inject it into generated images, limiting creative flexibility and becoming annoying.

Avoid photos with strange or highly complex poses. Such poses are harder to learn and tend to cause body proportion artifacts during training.

A dataset like this typically reaches optimal quality after roughly 2500–3000 training steps. On an RTX 3090, that takes about two to three hours.

5. Writing Captions for Training Images (Also Important)

The general principle is simple: Given the caption alone, the base model should be able to generate an image very similar to the training image, except for the specific identity of the person being trained. In that case, the LoRA's job becomes learning only the unique characteristics of the person rather than everything else. This makes training both faster and more accurate.

Manually describing every image is tedious. Use any reasonably capable image-captioning model. Ask it to describe the image in full detail and generate a prompt that could reproduce the same photo. Then prepend a unique token and character classifier, for example: "my_g123rl woman". Resulting in something like: "my_g123rl woman close-up headshot against a muted gray background. She has dark, curly hair that frames..."

Eye color and hairstyle do not necessarily need to be included if you have no intention of changing them later. However, including them is also fine. The model will learn them anyway, and explicit descriptions may make later prompt control easier. I personally include them because I’m too lazy to remove them from every caption, and in practice they cause very few problems.

6. Configuring OneTrainer

The following settings are intended for a GPU with 24 GB of VRAM and are optimized for maximum speed.

  • General Tab

Select "Z-Image" and "LoRA" in the top-right corner.

Set Temp Device to: cuda. If left on CPU, VRAM usage decreases slightly, but backups and sampling become slower due to model transfers.

  • Model Tab

Base Model should point to a local Diffusers-format Z-Image model. Download it first, obviously.

Model Output Destination can be anything, since in 99% of cases we will use one of the intermediate LoRA checkpoints instead.

Compile Transformer Blocks: enabled. Training becomes slightly faster. Disabling it frees roughly 3 GB of VRAM.

Transformer Data Type: bfloat16. Highest quality, no quantization required, and fits into VRAM. If memory is limited, use float8 or int8.

SVDQuant: Disabled if using bfloat16. If using int8, try enabling SVDQuant with bfloat16 quantization and rank 16. This consumes a couple of extra gigabytes but improves training accuracy.

Text Encoder 1 Data Type: nfloat4. We are not training it anyway, and I have not found a way to unload it completely from memory.

Output Data Type: bfloat16. More than sufficient for character LoRAs.

  • Data Tab

Enable everything. Cache cleaning is optional - the startup time difference is only about ten seconds.

  • Concepts Tab

Create a concept using the character photos. Use each photo exactly once. Do not mirror images. Do not artificially duplicate the dataset. Otherwise, subtle facial asymmetries will disappear, making the character less similar to the real person.

Do not use tag dropout. Our captions are already written in proper natural language.

Simply specify the image folder and leave everything else at default settings.

  • Training Tab

Optimizer = PRODIGY. Yes, PRODIGY, not ADAMW.

Learning Rate Scheduler = COSINE. Otherwise PRODIGY tends to keep increasing and may cause overtraining.

Learning Rate = 1.0. Yes, exactly 1.0.

Learning Rate Warmup Steps = 0

Epochs: 250 for datasets with around 10 photos. 200 for datasets with around 20 photos.

Local Batch Size = 1. Yes, 1, not 2. Batch size 2 is often used to compensate for dataset inaccuracies. Our dataset is highly curated and precise, so every image can be trained individually to extract maximum information.

Accumulation Steps = 1. Same reasoning.

Gradient Checkpointing = ON. Otherwise VRAM will not be sufficient.

Train Data Type = bfloat16

Resolution: "1024, 768, 512". Exactly like that, three comma-separated values.

Leave everything else at default values.

Offset Noise Weight = 0.0

Perturbation Noise Weight = 0.0

Z-Image training tends to converge in jumps rather than gradually.

  • Sampling Tab

Sample every 50 steps. Skip first 500 steps.

Add two samples:

Sample 1: 768x768, seed 41, prompt: "<character token> person head photo", cfg=4, steps=20

Sample 2: 1024x1024, seed 41, prompt: "<character token> person in underwear dancing in dynamic pose", cfg=4, steps=20

This provides continuous monitoring of both facial quality and body quality during training.

  • Backup Tab

Backup After = 50 steps

Rolling Backup = Yes

Rolling Backup Count = 2

Save Every = 50 steps

Save Filename Prefix: my_g123rl_

This way, a checkpoint is saved every 50 steps. Later, you can select the version that achieves the best balance between face quality and body quality.

  • LoRA Tab

LoRA Rank = 32

LoRA Alpha = 1.0

Everything else can remain at default settings.

You could experiment with Rank 16, which may also work. However, Rank 32 works reliably across virtually all characters, datasets, and models, so I see little reason to change it.

Training setup is now complete. Launch OneTrainer.

7. Selecting the Best LoRA

Training will usually produce several good checkpoints. Look for: Clean fingers. Not overburned tones. Correct anatomy.

Choose the best checkpoint from the later stages of training. Test it in ComfyUI. Evaluate character similarity and consistency. If you are not satisfied, try an earlier checkpoint. Repeat until you find the best one.

8. Generating the Actual Photoshoot

You can write your own prompts, find prompts online, or use an LLM to describe existing photos just as you did during dataset preparation.

However, there is an easier approach. Use any Img2Img workflow compatible with Z-Image. Add the 5-step distillation LoRA for fast generation. Load your character LoRA. Set in KSampler: Denoise = 0.72, Steps = 5. Alternatively, use Advanced KSampler: Total Steps = 7, Start Step = 2.

Take a photo you like. Feed it into KSampler. Use a very simple prompt: "woman" or "man".

The result will resemble the reference image while naturally adopting the appearance and characteristics of your trained character.

That's it.

Happy generating :)

0