Sign In

Flux Character Caption Differences - Training Diary

Flux Character Caption Differences - Training Diary

I wanted to do more Flux training experiments, and I got some ⚡⚡⚡ buzz donated ⚡⚡⚡ from a user to run some character experiments, so run the experiments I did!

This training still focuses on different caption types, to learn and spread more knowledge about training LoRAs with Flux. Specifically character training.

Shadowheart Flux Character LoRA

I've uploaded each version of the model as a separate version. In this article, similar to my previous Flux Training captioning diary, I will talk about the different settings and my observations.


The Dataset

Since this is a very popular character it already has several models on CivitAI. I put together a dataset using generations from those models as the base, and with a few screenshots from the game, and a few fan-arts online. 30 images in total were used.

  • 11 images were in anime style

  • 10 images were in a semi-realistic (2.8D) style

  • 9 anime were 3D / ingame graphics style


JoyCaption-NoTrigger

Steps: 1050
Resolution: 512
Batch Size: 2
Unet LR: 0.0005
Network Dim: 2
Network Alpha: 16
Optimizer: AdamW8Bit

This version used the recommended settings from the CivitAI Flux Training Documentation.

This was trained on complex captions with very long descriptions, without using a trigger word to activate the character.

Example caption

This is a highly detailed digital illustration in fantasy art style. The subject female elf with pointed ears, fair skin, and slender, athletic build. She has long, dark hair styled high ponytail single, thick braid. Her eyes are striking green, she wears an elaborate, ornate headpiece large, green gem the center. attire form-fitting, sleeveless outfit deep neckline that reveals significant amount of cleavage, accentuating her medium-sized breasts. top made dark, glossy material gold accents form V-shape on chest. also tight, black pants highlight curvaceous hips legs.

The background serene, twilight sky gradient transitioning from blue at to soft orange near horizon, suggesting either sunrise or sunset. There faint, distant mountains few floating stars, adding mystical atmosphere. lighting even, giving image smooth, polished appearance. overall mood one mystique, focus character's confident regal presence.

Model Results

This version is versatile, but you need to use a couple of keywords to trigger the character's look. You have to describe an elf-like appearance, pointy ears, or a fantasy character. Perhaps add a crown and armor, and you'll get the entire character.

The character is also very customizable. Gender can be swapped, different poses can be achieved as well as completely different clothes and appearances.

However, the appearance must come early in the prompt, before the Shadowheart character explanation.


WD14-NoTrigger

Steps: 1050
Resolution: 512
Batch Size: 2
Unet LR: 0.0005
Network Dim: 2
Network Alpha: 16
Optimizer: AdamW8Bit

This was trained on WD14-style tag captions without using a trigger word.

Example caption

1girl, solo, long hair, breasts, blush, looking at viewer, bangs, blue eyes, brown hair, large breasts, black hair, photoshop \(medium\), shirt, gloves, dress, long sleeves, original, medium breasts, jewelry, green eyes, closed mouth, standing, ponytail, braid, upper body, short sleeves, cowboy shot, sidelocks, earrings, sky, parted lips,

Model Results

This model did not capture the character so well if you prompt it with long and complex captions. This makes a lot of sense, since the captions are in comma-separated tag-based form.

If you instead prompt it with simpler tag-based words however, the model finds the character just fine.

I noticed that even if you don't prompt for it, the model adds a bit more anime style to the outputs. This is likely because it matches and merges the training captions with existing similar captions, which I could guess is how some of the core Flux model's anime training is done. It seems like a reasonable captioning strategy.

Example of simple generation prompt that captures enough of the character:

1girl, bangs, black hair, ornament, jewelry, standing, armor, green eyes, nose scar, makeup


NoCaption-Trigger

Steps: 1050
Resolution: 512
Batch Size: 2
Unet LR: 0.0005
Network Dim: 2
Network Alpha: 16
Optimizer: AdamW8Bit

This was trained on no captions, except for using a trigger word to activate the character. I'm using the trigger word sh4d0wh34rt to make sure the model does not have any existing knowledge of the Shadowheart character, and also to make sure it doesn't bleed in the words "shadow" and "heart".

Example caption

sh4d0wh34rt

Model Results

This works really well. I can use only the prompt A realistic photo of a Sh4d0wh34rt woman and it creates a reasonable version of the character. Then using descriptive words matching her outfit or design brings out even more of the trained data.

The model is flexible when it comes to costumes, actions, while still maintaining the character design. Gender-swapping is weaker than the JoyCaption models though.


Joycaption-Trigger

Steps: 1050
Resolution: 1024
Batch Size: 2
Unet LR: 0.001
Network Dim: 4
Network Alpha: 32
Optimizer: AdamW8Bit

This version was trained on complex captions with very long descriptions, as well as using the trigger word sh4d0wh34rt at the start of the caption.

Example caption

A sh4d0wh34rt female character. This image is a highly detailed digital illustration depicting fantasy elf-like character with pointed ears, fair skin, and long, dark hair. The subject young woman serene expression, her eyes closed lips slightly parted. She has delicate, feminine face hint of freckles on nose cheeks. Her skin smooth flushed, adding to the ethereal, otherworldly feel artwork.

She wearing headpiece that resembles crown intricate, metallic patterns, necklace adorned beads pendants add touch attire. background textured, blend deep blues purples, creating mystical atmosphere. lighting soft moody, casting shadows highlight contours textures hair clothing.

Her right hand gently touching cheek, fingers spread, subtle reflection light nails. overall style realistic focus fine details textures, making appear lifelike immersive. artist's signature visible side image, personal piece.

Model Results

This model surprised me at first. It did not at all produce good outputs at all. I do not think it has anything to do with the captioning, but rather the fact that I increased the Learning Rate to 0.001 from the default of 0.0005. This is 2x the original LR, so it makes sense that it only picks up on the broad strokes and not the details.

The costume and key character designs are there. Prompting with A fantasy artwork of a Sh4d0wh34rt woman. gives you an armored female character with pointy ears, black or gray hair, green eyes, sometimes a diadem and sometimes the right nose and mouth for the character. So it has started learning, but not at a levels where it manages to capture the details.

With additional prompting you can get more out of the model. The model could be used to generate imperfect versions of the character (like realistic people cosplaying as the character). But overall, the model is not useful compared to the other versions.


WD14-Trigger

Steps: 1050
Resolution: 1024
Batch Size: 2
Unet LR: 0.00025
Network Dim: 4
Network Alpha: 32
Optimizer: AdamW8Bit

This version was trained on WD14-style comma-separated tagging captions without using the trigger word sh4d0wh34rt. Worth noting is that I experimented with the Learning Rate of the model here. I used 0.00025 instead of the default 0.0005. This is 0.5x the original LR.

Example caption

a sh4d0wh34rt female character, 1girl, solo, long hair, breasts, blush, looking at viewer, smile, bangs, black hair, hair ornament, photoshop \(medium\), shirt, long sleeves, original, jewelry, green eyes, closed mouth, ponytail, braid, upper body, hairband, sidelocks, earrings, parted lips, outdoors, day, pointy ears, blunt bangs, cape, water, armor, twin braids

Model Results

The halved learning rate really shows in the model. You can see that it captures the finer details, like facial features, the scar and freckles on her face. But the armor, diadem and bigger picture details are not there.

This could be somewhat useful if you are going for a version where you want to modify the big picture of the character, but keep the details, but similar to the Joycaption-Trigger-version of the model above, I think it's just a less useful model than the three first LR 0.0005 models.

Learning Rate Note

I did also change the resolution and network dimension and alpha on the 2 "failed" models. So it could also be those factors giving us trouble here. I do however think it's all about the LR in this case.


Conclusion

Learning Rate matters! A lot!

The relative learning rates, compared to the "default" of 0.0005.

Using the CivitAI onsite trainer, you'll get some good defaults. Use them.


Recommended training

I have two favorites from these trained versions.

Flexibility

The NoCaption-Trigger-version wins this category. This model right away gives you the character, and it's the easiest to transform into something else. Other costumes and poses.

The drawback is that you need to prompt for her original outfit to get it.

Best Captured Likeness:

The JoyCaption-NoTrigger-version was the best at reproducing the character in full.

Worth noting is that I believe that using JoyCaption with a trigger would result in an even stronger model, when training on the appropriate Learning Rate.

Did you say you got some sponsored buzz?

Yes! This is how we get cool articles like this. More ⚡⚡⚡ = more cool training like this!

I don't mind if you drop a handful of buzzes with the button right below here! Press it and watch the ⚡ go up!

108

Comments