Flux Dreambooth Fine-Tuning with Tiled Images

Introduction

I’m excited to share my journey of using Flux training on tiled images of a character with the Kohya-ss GUI. My interest was sparked after stumbling upon a Reddit discussion about Flux Sigma Vision Alpha 1 (see Reddit Post). I decided to apply its approach to my own dataset, and the results were nothing short of mind-blowing. The resemblance of the trained character is nearly perfect—every detail of the skin and face, from subtle lines around the eyes to natural textures and delicate freckles, was captured with astonishing clarity. I hope this guide will help others navigate similar projects.

Methodology

1. Data Preparation and Tiling

Why Tiling?

Tiling the images allows the model to focus on smaller segments, enhancing its ability to extract detailed features. By segmenting high-resolution character images into manageable tiles, the model can capture intricate details that might otherwise be lost.

The Process

· Script Development:
I wrote a custom script to automate the tiling process. The script divides each high-resolution image into smaller sections with a 50% overlap and padding, ensuring that no detail is lost.

· Unified Captioning:
To handle a large dataset, I developed a unified captioning method that maintains consistency across all images. This standardization is crucial for reliable training.

Additional Considerations:

· Image Resolution & Tile Calculation:
For example, if your training model resolution is 1024, each tile should be approximately 1024x1024 pixels. Keep in mind that this method increases the number of images in your dataset, which will consequently increase training time. If your images are not high resolution, consider reducing the number of tiles accordingly.

· Aspect Ratio:
Although the script attempts to generate 1:1 tiles, some may not perfectly adhere to that ratio. This is generally not a problem for training, and you can avoid using the bucketing option in Kohya.

For context, my last training session used 697 tiled images over 80 epochs, achieving a throughput of 8.25 iterations per second on an RTX 3090. If your dataset contains more images, you might consider lowering the number of epochs.

2. Flux Training Configurations

There are two training configurations available:

· Flux Dev (fp16):
Best suited for training a single character, this configuration offers faster generation times without compromising quality.

· Flux De-distill:
While training on a single character shows little quality difference compared to Flux Dev, generation time increases by nearly 50%. However, when training on multiple characters, the results are significantly better—with minimal bleeding between characters. I batch-generated 100 prompts for multiple characters with excellent outcomes.

3. Automating with a Custom Script

Creating a script to automate the tiling process is key to managing large volumes of data efficiently.

Script Highlights

Automated Processing: Automatically reads high-resolution images.
Tile Division: Divides images into tiles based on specified dimensions.
Consistent Naming: Saves output with standardized naming conventions for seamless integration into the training pipeline.

Benefits

Time-Saving: Removes the need for manual tiling.
Error Reduction: Minimizes human error.
Standardization: Ensures a consistent dataset for reliable training.

For more details or to download the script, visit the GitHub repository.

Mind-Blowing Results

After training, the improvements were remarkable:

Enhanced Detail: The tiled approach enabled the model to capture intricate features of the character.
Consistent Quality: Using a unified caption for the dataset resulted in smoother training and more coherent outputs.
Efficiency: Automation significantly reduced preprocessing time, making the entire training pipeline more efficient.

When compared to previous experiments without tiling, the differences are undeniable. The model produced sharper details, improved texture representation, and overall higher fidelity in the output images.

Conclusion

This guide detailed my experience fine-tuning Flux using tiled images, highlighting significant improvements in image quality and training efficiency. I hope it inspires you to experiment with your own datasets and share your findings with the community.

Note: Due to my busy work schedule and personal commitments, my online activity is currently very low. I might not be able to respond to replies promptly. However, if you have any suggestions or advice for future experiments, please feel free to share them—I value your input and look forward to learning from your experiences.

Happy training!