I want to share one of my go-to captioning workflows - a piece of practical know-how that has saved me a lot of pain when training LoRAs.
If you’re training a LoRA, captioning isn’t optional. It’s one of the biggest levers you have for controlling what the model actually learns.
Unless you’re training a style or a pose*, the quality of those captions is absolutely critical. If your descriptions are too generic-or worse, identical across many images-you’re basically feeding the model conflicting instructions. For example, imagine you’re training a face with 20 photos and every single caption says: “Adam, man, profile, brown hair, green eyes.” That won’t be enough. The model sees 20 different images paired with the same text, so it has no way to know what details are consistent and what should change. On the first image it learns “this is what Adam, man, profile, brown hair, green eyes looks like,” adjusts its weights, then it gets the second image with the exact same caption and has to adjust again to fit that version. Repeat that 20 times and you end up with the model constantly tugging in different directions-trying to map one caption to multiple slightly different realities. In most cases it won’t converge into a clean identity. According to my experience and real tests, it’ll just get confused (with a few niche exceptions).
That’s why, ideally, you want captions that are specific and varied-a unique description for each image that reflects what’s actually different: angle, lighting, expression, background, accessories, hairstyle changes, camera distance, and so on.
Captions define which tokens the model should tie to specific visual features. Incorrect or generic captions add noise and push the model toward misleading associations, resulting in weaker control and less consistent outputs. If you don’t have captions you trust, it’s usually better to use none than to train on misleading text. It can produce better results than training on bad captions.
With a tiny dataset (say up to ~10 images), you can usually get away with writing captions by hand. It’s quick, manageable, and you stay in full control.
But once your dataset grows, manual edits turn into total hell - at least for me. Suddenly you’re spending more time fixing captions than curating images. You tweak one thing, realize it should be consistent everywhere, and now you’re stuck doing the same change across 50–200 files.
So… what do we do instead?
Before we jump into the workflow, let’s quickly cover what we’ll be using:
COMFY UI - we will use it for captioning
RAPID TAGGER - we will use it for captions magement
So - we can make captioning a lot easier by auto-captioning our images with a vision model, for example WD14 or JoyCaption/Qwen3-VL-8B-NSFW/ToriiGate-v0.4-7B.
WD14 produces tags (a booru-style keyword list).
JoyCaption/Qwen3-VL-8B-NSFW/ToriiGate-v0.4-7B is LLM model and is more demanding, but can generate full sentences.
Here’s the catch: when training SDXL, you generally want to feed it natural language, sentence-style captions, not pure tag soup. SDXL tends to respond better to captions that read like something you’d actually type into a prompt.
But JoyCaption isn’t perfect either. In my experience it can struggle with some NSFW topics, and it may get unsure or vague exactly where you don’t want ambiguity.
Meanwhile, WD14 is often more accurate and consistent in what it recognizes - especially for concrete visual attributes - but it gives you that tag format that isn’t ideal for SDXL if you use it “as-is”.
Solution?
Use both.
So the practical approach is: use both tools for what they’re good at, and then shape the output into captions that match your training goal (SDXL-friendly sentences, but grounded in WD14’s reliable detection).
In the past I was using JoyCaption, but turned out to use a lot "empty words" like "there is" or "there might be" or "looks like". I found ToriiGate can describe the scene much better, without empty phrases.
My workflow is simple: let WD14 do what it’s best at first, then use that output as guidance for LLM.
Run WD14 first → you get a solid, fairly accurate tag list (objects, clothing, composition, some style cues).
Feed those tags into LLM / ToriiGate → treat them as a hint / anchor. LLM suddenly has something concrete to “hold onto,” and from my experience it becomes more consistent, especially in tricky cases (including parts of NSFW where it might otherwise get vague or dodge details).
And in the final captions I use for training, I don’t pick one or the other - I combine both:
First: a tag list (cleaned, curated, consistent)
Then: a few sentence-style descriptions (what’s happening, who/what is present, key attributes)
This way you get the best of both worlds: WD14’s precision plus SDXL-friendly natural language.

Instruction for JOYCAPTION:
Shortly describe the [pose of the person] using simple sentenses. Shortly describe what the viewer can see. Important: NEVER mention what is not visible, describe ONLY what is displayed and clear. Write three sentenses. Omit phrases like "There is", "We can see" and similar ones.Tags from this photo for your reference: [here comes the list from WD14]The key trick is replacing [pose of the person] with a phrase that’s specific to your dataset.
Instead of a generic “pose of the person,” you hardcode what you actually care about teaching:
“pose of the girl” (character/subject-focused sets)
“relative position of their bodies” (useful for NSFW where anatomy/interaction matters)
or literally any dataset-specific target, like hands, face expression, camera angle, outfit details, etc.
That one little substitution acts like a steering wheel: it tells LLM what to prioritize in Sentence 1, so your captions become consistent across the set.
Same idea for the WD14 tag block: before you paste the WD14 list, you can inject a short dataset context hint - something LLM would never infer reliably on its own.
For example, you can prepend a line like:
Dataset concept: “a turtle wearing socks with a colander instead of a shell”
Even if WD14 spits out random tags that don’t capture the concept well, LLM now has a “theme anchor” and is far more likely to keep describing the images in the direction you want.
In practice, the flow looks like this:
Add a dataset-specific instruction (the “what to focus on” phrase)
Add a dataset context hint (the “what this set is about” sentence)
Paste WD14 tags as visual grounding
Force exactly three sentences, no guessing, no filler openers
Result: captions that are consistent, SDXL-friendly, and still grounded in WD14’s accuracy.
Next comes Rapid Tagger - this is where the captions stop being “auto-generated” and start becoming training-grade.
In Rapid Tagger, I usually do a few key things:
Add my trigger word (or multiple triggers, depending on the LoRA setup).
Reorder the important tags so the most relevant concepts appear early.
Go through the list and sanity-check everything: fix mistakes, add missing details, remove noise.
A very common fix for me is adding information that captioning tools often skip or underemphasize, like:
“low angle” / “high angle” (camera/viewer perspective)
clear body pose tags like “kneeling”, “sitting”, “standing”
other “dataset-critical” tags that need to be consistent across images
Rapid Tagger is also great because it lets me work both ways:
Per-image edits (fine corrections for a single picture)
Batch edits (apply consistent changes across the whole dataset)
So the pipeline becomes: auto-caption fast → post-process into one clean line → then use Rapid Tagger to polish and standardize the dataset until it’s consistent enough for reliable LoRA training.
And with that, the dataset is basically ready to feed into training.
For training tools, I personally rotate between OSTRIS AI Toolkit, OneTrainer, and Kohya — pick whatever fits your workflow and how much control you want.
If you’re just starting out, I’d recommend AI Toolkit as the easiest entry point. It’s more beginner-friendly, doesn’t overwhelm you with a million knobs, and you can get solid results without needing to understand every single training parameter on day one.
Once you feel comfortable and want deeper control (or you’re chasing very specific behavior), moving to OneTrainer or Kohya makes sense - but for many people, AI Toolkit is the fastest way to go from “I have images” to “I trained a usable LoRA.”
In my experience, AI Toolkit is about 3-4x slower than Kohya, and I haven’t been able to match Kohya’s results regardless of settings. I tested on the same dataset with the same captions and tried countless configurations - over 20 training runs across two weeks, each lasting 30-300 minutes. While the AI Toolkit results were usable and I would have been satisfied with them, Kohya consistently produced better results. This is just my personal experience.
I’d love to hear your thoughts in the comments - what do you think about combining WD14 + LLM for LoRA captioning?
Share your experience: what works for you, what doesn’t, and what you’ve found to be the most reliable approach. And if there’s anything you’d like me to cover next (settings, example captions, ComfyUI nodes, NSFW-specific wording, etc.), tell me and I’ll expand the guide.
* style - pose - character
For pose or style you do not want extra captions. You need good captions only for character.
I will put this into separate article about training.
