Sign In

The caption process of dAIversity Detailer

7
The caption process of dAIversity Detailer

dAIversity Detailer Training Data Write-up

SDXL LoRA - Flux LoRA - SD1.5 LoRA - Flux Checkpoint

The images for the dataset were generated with Dreamshaper XL Lightning using some random prompt wildcards (https://civitai.com/models/1095990/200k-random-prompts-wildcard-flux-sdxl).

The captions were created in a long multi-step process.

Example image:

I used Moondream v1 with 21 different prompts per image, to capture details about specific aspects of the image. These 21 queries, were then combined into one long description, which a language model (I think Llama 3.0), compressed down to the essentials of the image, for a long caption.

Example of the outputs of these queries:

{"Art Style": "The image is a digital art piece featuring a white chihuahua dog dressed in a knight's armor, complete with a cape and a shiny golden shield. The dog is posed in a way that makes it appear as if it is a character from a video game or a work of art.", "Background": "A dog is depicted in the image, wearing a golden armor and a red cape. The background features a wall with a pillar, and the dog is situated on a floor.", "Dominant Color": "The most dominant color in this image is white, as it appears in the dog's fur, the armor, and the background. The white color is also present in the dog's eyes, which adds to the overall visual effect of the scene.", "Colors": "The image features a white dog dressed in a golden armor, with a red cape. The dog is wearing a golden shield, and the armor is adorned with gold designs. The colors used in the image are predominantly white for the dog's coat and golden for the armor and shield, with red accents on the cape.", "Element Composition": "In the image, a white dog is dressed in a golden armor, resembling a knight. The dog is sitting on a stone floor, which serves as the background. The dog is positioned in the center of the image, with its eyes and ears facing forward. The armor covers the dog's body, emphasizing its regal appearance. The stone floor provides a solid and stable surface for the dog to sit on, while the background adds context to the scene.", "Composition": "The image is a white dog dressed in a golden armor, sitting on a floor and looking at the camera with a regal pose.", "Contrast": "In the image, there is a contrast between the dog's attire and the surrounding environment. The dog is dressed in a golden armor, which is a combination of gold and red colors, and it is sitting on a stone floor. The stone floor and the dog's outfit create a visually striking contrast against each other. Additionally, the dog's white fur stands out against the golden armor, further emphasizing the contrast between the dog and its surroundings.", "Cultural Indicators": "In the image, the small white dog is dressed in a golden armor, which could be a representation of a medieval or fantasy-inspired theme. The armor suggests that the image might be inspired by a specific culture or historical period. Additionally, the dog is wearing a red cape, which could be another cultural indicator or a decorative element to enhance the visual appeal of the scene. The presence of a fire in the background might also be a cultural element, such as a fireplace or a symbol of warmth and comfort in the depicted setting. Overall, the combination of the dog's attire and the fire in the background contribute to the cultural context of the image.", "Emotions": "The dog appears to be feeling proud and confident, as it is dressed in a golden armor and standing tall with its head held high.", "Focal Point": "The focal point of the image is a white dog dressed in a knight's armor, sitting on a stone floor.", "Description": "This image features a white dog dressed in a knight's armor, complete with a shiny suit and a cape. The dog is sitting on a stone floor, looking regal and majestic. The armor is made of a shiny material that reflects light, giving the dog an impressive appearance. The dog's attire and pose make it seem like a character from a story or a work of art.", "Lighting": "The image features a white dog dressed in a golden armor, sitting on a stone floor. The light source is positioned behind the dog, shining on it. The light creates a spotlight effect, focusing the light on the dog and highlighting its golden armor. This creates a visually striking and dramatic scene, drawing attention to the dog's attire and making it the focal point of the image.", "Main Objects": "The main objects present in this image are a dog and a dog suit.", "Mood": "The image of a small white chihuahua dressed in a knight's armor and cape conveys a whimsical, playful, and imaginative atmosphere. The dog is dressed in a regal and majestic costume, which is not typical for dogs, and it is sitting on a fire pit, adding a touch of humor and creativity to the scene. This image can evoke feelings of joy, amusement, and a sense of wonder, as it combines elements of a beloved pet with a fantastical or fairy tale theme.", "Perspective": "The image is taken from a close-up perspective, focusing on the dog's face and outfit.", "Pose - Short": "A dog in a golden armor posing for a picture.", "Pose": "A dog is wearing a golden armor and a red cape.", "Race": "The character in the image is a white chihuahua.", "Setting": "In the image, a white dog is dressed in a knight's armor and cape, standing on a floor with a fire behind it. The dog appears to be in a castle-like setting, with the armor and cape giving it a regal appearance. The fire behind the dog adds to the ambiance of the scene, creating a dramatic and majestic atmosphere.", "Time of Day": "In the image, a small white dog is dressed in a knight's armor and standing on a stone floor. The dog is wearing a cape and a shiny golden shield, giving it a regal appearance. Although it is difficult to determine the exact time of day from the image alone, the presence of the cape and the knight's attire suggest that the photo might have been taken during a time when people dressed in medieval-style clothing were more likely to be out and about, such as during a festive occasion or a historical reenactment. However, without additional context or visible elements in the image, it is not possible to pinpoint the exact time of day.", "Time Era": "The image is set in a medieval-like era, as evident by the dog's golden armor and the overall appearance of the dog dressed as a knight."}

This was then combined with some original WD14 tags for compatability and to create a tag + caption based approach.

Example output:

armor, looking_at_viewer, shoulder_armor, red_cape, pauldrons, animal, cape, blurry_background, no_humans, blurry, sitting, solo, furry, indoors, animal_focus, breastplate, dog, white_fur, brown_eyes, A small white chihuahua dog is dressed in a knight's armor, complete with a golden shield, standing regally on a stone floor. The background is dark and blurry, with a hint of a fireplace in the distance, suggesting a medieval-like setting. The dog's armor is adorned with gold designs, contrasting with its white fur and creating a striking visual against the dark background. Its red cape flows behind it, adding a touch of color to the image. The dog looks directly at the viewer with a confident expression, as if it's ready to embark on a heroic quest at any moment.

This is what was used to train the the dAIversity Detailer LoRA for SDXL (https://civitai.com/models/477442), Flux (https://civitai.com/models/700837), which was also merged into the Flux dAIversity Checkpoint (https://civitai.com/models/711900).

I also used the same dataset to create the paligemma-longprompt vision model (https://github.com/MNeMoNiCuZ/paligemma-longprompt), which then generates captions in this style from scratch.

Prompts

You can find attached (prompts.zip), the prompts I used with moondream to get the results.

I learned a few tricks to manipulate the outputs of Moondream along the way.

  • Use simple descriptions of what you want

  • No need for correct grammar, the model is pretty daft

  • Starting with the term "short words" makes it output cleaner shorter results without babbling on about other details

Examples:

{
  "prompt-title": "Pose - Short",
  "prompt-category": "Pose, Character",
  "prompt-text": "Short words describe the pose of this"
}
{
  "prompt-title": "Lighting",
  "prompt-category": "Description, Atmosphere",
  "prompt-text": "Explain the lighting in the image. Where is the light source, and what effects does it create?"
}
{
  "prompt-title": "Focal Point",
  "prompt-category": "Composition, Detail",
  "prompt-text": "Identify and describe the focal point of the image."
}
{
  "prompt-title": "Main Objects",
  "prompt-category": "Objects, Detail",
  "prompt-text": "List the main objects present in this image."
}
{
  "prompt-title": "Race",
  "prompt-category": "Race, Character",
  "prompt-text": "What race or species for this character, summary only"
}

Dataset

The dataset contains 1000 images.

The file is too large to attach in articles, but you'll find it attached to the models:

SDXL LoRA - Flux LoRA - SD1.5 LoRA - Flux Checkpoint

7

Comments