Dataset Prep for Character LoRAs: How the 0.33 Identity Ratio Actually Works

Quick follow-up to the settings article — a few people asked about the dataset side specifically (thanks @edwenind220 for the comment that prompted this). This post breaks out the dataset prep step in more detail: what goes into the training set, how we structure the identity_ratio: 0.33, and the failure modes you'll see when the ratio is wrong.

Two image categories — that's the whole architecture

Every image in a character LoRA's training set falls into one of two roles:

Identity-emphasis (~33% of the set): close-up portraits, head-and-shoulders shots, clean backgrounds. The character is the entire frame. The LoRA learns "who is this person" from these.
In-context (~67% of the set): full-body shots across varied environments, poses, outfits, and scenes. The character is doing something somewhere. The LoRA learns "how this person renders in the wild" from these.

That 33/67 split is what identity_ratio: 0.33 in our spec encodes. The exact ratio isn't magic — anywhere from 0.25 to 0.4 works for most characters. What matters is both categories are present and meaningfully sized.

Why the ratio matters

A LoRA trained on pure identity-emphasis shots (95% closeups, 5% full-body) develops a clear pathology: it generates great faces but loses the character's body type, outfit consistency, and pose flexibility. The LoRA "knows the face" but treats everything below the neck as a stranger. You see this in a lot of "selfie-style" character LoRAs published on Civitai — beautiful headshots, weird everything else.

The inverse — 95% in-context, 5% identity — gives you a LoRA that holds outfit/body but the face drifts toward generic anime when the prompt zooms in. Identity loss on closeups.

The 33/67 (or close to it) split tells the LoRA: the face is one set of features, the body+context is another, both are equally important and learnable. Caption structure reinforces the separation (see below).

Captions per category

Identity-emphasis captions: heavy on face and hair tokens, light on environment and pose. Example for our character Yuki:

```
1girl, solo, yuki_nk, white_hair, very_long_hair, straight_hair,
center_parting, ice_blue_eyes, pale_skin, sharp_features,
close-up, looking_at_viewer
```

Plus the negative tokens for off-spec features (black_hair, blonde_hair, brown_eyes, green_eyes).

In-context captions: include environment, pose, outfit, but still ground in the character's identity tokens:

```
1girl, solo, yuki_nk, white_hair, very_long_hair, ice_blue_eyes,
white turtleneck sweater, gray pleated skirt, white boots,
library, sitting, reading
```

Same identity tokens. Plus environment + clothing + action. The LoRA sees yuki_nk paired with white_hair across both sets — that's the signal that says "this character is always this person."

The caption_dropout: 0.3 piece

This is the part most character LoRAs skip. caption_dropout: 0.3 tells the trainer to randomly drop 30% of caption tokens each training step.

Why it matters: without dropout, the LoRA learns the whole caption as the trigger condition. Prompt 1girl, yuki_nk, white_hair, library, reading at inference and it works — but prompt 1girl, yuki_nk, white_hair, beach, swimming and the LoRA struggles because "library + reading" was deeply entangled with the rest of the identity.

With 30% dropout, the LoRA is forced to learn each token's meaning somewhat independently. So at inference, yuki_nk alone activates the character even when paired with a brand new environment + pose combo the LoRA never saw in training. That's how you get a character LoRA that "generalizes" instead of overfitting to its training scenes.

Failure modes (and what they tell you)

| Symptom | Likely cause | Fix |
|---|---|---|
| Face looks right at closeup but body proportions drift | Too little in-context, too much identity-emphasis | Add more full-body shots, lower identity_ratio toward 0.25 |
| Body consistent, face turns generic on closeups | Inverse — too little identity-emphasis | Add more closeups, raise identity_ratio toward 0.4 |
| Character only renders in training environments | Caption dropout too low (or zero) | Set caption_dropout to 0.3 — that's the strongest single fix for generalization |
| Hair color flips to off-spec in some scenes | Weak negative tokens or insufficient identity-emphasis | Add explicit negatives for off-spec colors, increase identity-emphasis count |
| Character "forgets" the LoRA at low weight (<0.5) | Rank too low, or training under-converged | Bump rank to 48-64, or train more epochs |
| Outfit/body type wrong | Body tokens missing from captions, or identity-emphasis was over-curated to face-only | Make sure identity-emphasis shots include medium_breasts, slender_waist, tall_female etc. |

Practical workflow

Generate or collect ~30-60 reference images covering the two categories.
Auto-caption with WD14-tagger or BLIP-2 to seed the captions.
Hand-edit captions to ensure identity tokens are consistent across both categories.
Add negative tokens to captions for off-spec features (don't just rely on the negative-prompt at inference).
Verify the 33/67 split visually — sort images by category and count.
Train at rank 48 / alpha 48 / caption_dropout 0.3 (settings article for the full config).
Test at weights 0.6, 0.8, 1.0 across 3-5 prompts that vary environment significantly.

If the LoRA holds identity at 0.7+ weight across 3+ environments, ship it. Don't over-optimize the first version — the engagement signal from real users is faster feedback than another iteration round in isolation.

What's next

We're rolling out the next 10 LoRAs over the coming days (Echo, Misa, Akane, Luna, Sakura, Rin, Nova, Neko, Rei, Mika) — all on this same dataset architecture. Curious whether the more visually distinct ones (Neko with cat ears, Echo with holographic hair) hold identity better than the more conventional ones (Akane, Misa) at lower weights.

If you've been training character LoRAs and have a different split that works for you, I'm interested in hearing it — drop a comment.

Full character roster + stories at neonkisu.com.