Note: This model is a Schnell based model, but it requires guidance scale 3 or 5 and CFG scale 3 or higher (not guidance scale) and 20 steps or more. Needs to be used with clip_l_sumeshi_f1s. (It's the 234.74 MB file in the menu on the right.)

注意:このモデルはSchnellモデルですが、Guidanceスケール3または5、CFGスケール3以上、20step以上必要です。付属のclip_l_sumeshi_f1sと合わせて使用する必要があります。(右のメニューの234.74 MBのファイルです。)

My English is terrible, so I use translation tools.

This is an experimental anime model to verify if de-distilling and enabling CFG will function. You can use a Negative prompt which works to some extent. Since this model uses CFG, it takes about twice as long to generate compared to a regular FLUX model, even with the same number of steps. The output is blurred and the style is blurred depending on the prompt, perhaps because the model has not been fully trained.

24/09/23 update

Added v004G. This is a test model aimed at reducing blurriness in low-step outputs (around 20 steps) by introducing guidance. Blurriness in both bright and dark outputs has been reduced. Due to training with parameters that push the limits to save time, response to prompts has worsened.The recommended parameters have been updated, so please refer to the Usage(v004G) section.After verification, two factors were suspected to cause blurriness, so we reinforced these areas during training.

Guidance Parameter: While v002E was filled with zeros, we used He initialization and conducted some training with FineTune and the network_args "in_dims". This enabled the guidance scale to function properly. Although the reason is unclear, outputs seem to be abnormal with values other than scales 3 and 5.
Timesteps Sampling:Previously, discrete_flow_shift 3.2 was used, but it was suspected to be a reason for poor response at low steps. Verification results showed that not using shift and having a smaller sigmoid_scale reduced blurriness. However, insufficient training leads to noisy backgrounds, so further exploration of hyperparameters seems necessary.

Usage(v004G)

resolution: like other Flux models
(distilled) guidance scale: 3 or 5
CFG scale: 6 ~ 9 recommend 7 ( Scale1 does not generate decent outputs. )
step: 20 ~ 30 (Not around 4 steps)
(distilled) guidance scale: 3 or 5 ( Does not work due to schnell based model. )

Usage(v002E old)

resolution: like other Flux models
CFG scale: 3.5 ~ 7 ( Scale1 does not generate decent outputs. )
step: 20 ~ 60 (Not around 4 steps)
(distilled) guidance scale: 0 ( Does not work due to schnell based model. )
sampler: Euler
scheduler: Simple, Beta

Prompt Format ( from Kohaku-XL-Epsilon )

<1girl/1boy/1other/...>, <character>, <series>, <artists>, <general tags>,<quality tags>, <year tags>, <meta tags>, <rating tags>

Due to the small amount of training, the <character><series><artists> tags are almost non-functional. And training is focused on girl, it may not be able to generate boy or other non-persons well. Since using hakubooru to create the dataset, the prompt format will be the same as the KohakuXL format. However, based on experiments, it is not strictly necessary to follow this format, as it interprets meaning to some extent even in natural language.

Special Tags

Quality tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality
Rating tags: safe, sensitive, nsfw, explicit
Date tags: newest, recent, mid, early, old

Training

Dataset preparing
I used hakubooru based custom scripts.
exclude tags: traditional_media,photo_(medium),scan,animated,animated_gif,lowres,non-web_source,variant_set,tall image,duplicate,pixel-perfect_duplicate
minimum post ID:1,000,000
key addition
I added tensors filled with zeros with the "guidance_in" key to the Schnell model. This tensor is adjusted to the shape of the corresponding key in Dev, as inferred from flux/src/flux/model.py. This is because the trainer did not work properly when these keys were missing if the model name did not include 'schnell'. Since it is filled with zeros, I understand that guidance, like in the Schnell model, will not function. Since my skills are lacking and I added it rather forcefully, I'm not sure if this was the correct approach.
Training
Basically, the assumption is that the more we learn, the more the network will be reconstructed, the more the distillation will be lifted, and the more CFGs will be available.
I trained using a single RTX 4090. The training is done using the LoRA method and merging the results.
sd-scripts was used for training. The basic settings are as follows (the guidance value is set to 7, which has no particular meaning because, as mentioned earlier, it is a zero tensor.)
```
accelerate launch --num_cpu_threads_per_process 4 flux_train_network.py --network_module networks.lora_flux --sdpa --gradient_checkpointing --cache_latents --cache_latents_to_disk --cache_text_encoder_outputs  --cache_text_encoder_outputs_to_disk --max_data_loader_n_workers 1 --save_model_as "safetensors" --mixed_precision "bf16" --fp8_base --save_precision "bf16" --full_bf16 --min_bucket_reso 320 --max_bucket_reso 1536 --seed 1 --max_train_epochs 1 --keep_tokens_separator "|||" --network_dim 32 --network_alpha 32 --unet_lr 1e-4 --text_encoder_lr 5e-5 --train_batch_size 3 --gradient_accumulation_steps 2 --optimizer_type adamw8bit --lr_scheduler="constant_with_warmup" --lr_warmup_steps 100 --vae_batch_size 8 --cache_info --guidance_scale 7 --timestep_sampling shift --model_prediction_type raw --discrete_flow_shift 3.2 --loss_type l2 --highvram
```
The following datasets are trained in the following order.
3,893images (res512 bs4 / res768 bs2 / res1024 bs1, acc4) 1epoch
60,000images (res768 bs3 acc2) 1epoch
36,000images (res1024 bs1 acc3) 1epoch
3000images (res1024 bs1 acc1) 1epoch
18,000images (res1024 bs1 acc3) 1epoch
merged into model and CLIP_L
693images (res1024 bs1 acc3) 1epoch
693images (res1024 bs1 acc3 warmup50) 1ecpoh
693images (res1024 bs1 acc3 warmup50) 10ecpohs
693images (res1024 bs1 acc3 warmup50) 15ecpohs
merged into model and CLIP_L
543images (res1024 bs1 acc3 warmup50 --optimizer_args "betas=0.9,0.95" "eps=1e-06" "weight_decay=0.1" --caption_dropout_rate 0.1 --shuffle_caption --network_train_unet_only) 20epochs
merged into model and CLIP_L
21,000images (res1024 bs1 acc3 warmup50 timestep_sampling sigmoid sigmoid_scale2) 15ecpohs
21,000images (res1024 bs1 acc3 warmup50 sigmoid_scale2 discrete_flow_shift3.5) 15ecpohs
merged into model and CLIP_L
-this training merged only CLIP-
3,893images (res1024 bs2 acc1 warmup50 unet_lr5e-5 text_encoder_lr2.5e-5 sigmoid_scale2.5 discrete_flow_shift3 --network_args "loraplus_lr_ratio=8") 3epochs
3,893images (res1024 bs2 acc1 warmup50 unet_lr5e-5 text_encoder_lr2.5e-5 sigmoid_scale2 discrete_flow_shift3 --network_args "loraplus_lr_ratio=8") 1epochs
merged into CLIP_L only
--
He initialized "guidance_in" layer
3,893images (Full-finetuned res1024 bs2 acc1 afafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" lr5e-6 warmup50 guidance_scale3.5 max_grad_norm 0.0 timesteps_sampling discrete_flow_shift 3.1582 ) 1epoch
3,893images (res1024 bs2 acc1 warmup50 guidance_scale1 timesteps_sampling sigmoid sigmoid_scale 0.5 --network_args "in_dims=[8,8,8,8,8]") 4epochs
3,893images (res512 bs2 acc1 warmup50 guidance_scale1 timesteps_sampling sigmoid sigmoid_scale 0.3 --network_args "in_dims=[8,8,8,8,8]") 12epochs
543images (repeats10 res512 bs4 acc1 warmup50 unet_lr3e-4 guidance_scale1 timesteps_sampling sigmoid sigmoid_scale 0.3 --network_args "in_dims=[8,8,8,8,8]") 4epochs
merged into model and CLIP_L
--v004G--

Resources (License)

License

Apache2.0

Acknowledgements

black-forest-labs : Thanks for publishing a great open source model.
kohya-ss : Thanks for publishing the essential training scripts and for the quick updates.
Kohaku-Blueleaf : Thanks for the extensive publication of the scripts for the dataset and the various training conditions.

蒸留を解除しCFGが機能するかを検証するための実験的なアニメモデルです。Negative promptがある程度機能します。このモデルはCFGを使用するため、同じstep数でも通常のFLUXモデルの約二倍生成に時間がかかります。モデルの学習が足りていないのためかプロンプトによっては出力のぼやけやスタイルのぶれが激しいです。

24/09/23 v004Gを追加しました。guidanceを導入することで、低ステップ(20step付近)での出力のぼやけを抑制する試験モデルです。明るい/暗い出力時のぼやけが軽減されています。時間短縮のため無理のあるパラメータでトレーニングしているので、プロンプトへの応答は悪くなっています。推奨パラメータが変更されていますので、Usage(v004G)を参照してください。検証を行った結果以下の二つがぼやけの要因であると推測されたため、その部分を学習で強化しました。

guidanceパラメータ:v002Eは0で埋めていましたが、これをHe初期化しFineTuneとnetwork_args "in_dims"にてある程度学習をすることでguidance scaleが機能するようになりました。理由は分かりませんが、scale3と5以外の値では出力がおかしくなるようです。
timesteps_sampling:これまではdiscrete_flow_shift 3.2を使用していましたが、これが低ステップへの応答を悪くしているのではと推測しました。検証の結果、shiftせず、sigmoid_scaleが小さいほどぼやけが軽減されることがわかりました。しかし学習が足りないと背景がノイジーになる欠点もあるため更なるハイパーパラメータの探求が必要そうです。

使用法

英語部分を参照してください。CFGスケール1ではまともな出力が得られないため、必ず3.5以上で使用してください。

プロンプトフォーマット

英語部分を参照してください。基本的にはKohakuXL同様のスタイルですが、自然言語でもある程度動くようです。学習量が足りないため、キャラ、作品、アーティストタグはほぼ機能しません。

特殊タグ

英語部分を参照してください。

学習

データセット作成
hakubooruを使用してデータセット作成を行いました。除外タグと使用post範囲は英語部分を参照してください。
キー追加
Schnellモデルへguidance_inキーを持つ0で埋められたテンソルを追加しました。このテンソルはflux/src/flux/model.pyから推測されるDevの該当キーのshapeへ合わせています。これはトレーナーがモデル名に"schnell"が入っていない場合これらのキーが無いと上手く動作しなかったためです。0で埋められているため、Schnellモデル同様guidanceは機能しないと認識しています。私のスキルが乏しくかなり強引に追加しているため、これが正しい方法だったかはわかりません。
学習
基本的に学習すればするほどネットワークの再構築が進み、蒸留が解除されCFGが使えるようになるのではという推測の元、学習を進めています。
RTX4090一台を使用して学習をしました。LoRAで学習してマージする方式で学習しています。
トレーニングにはsd-scriptsを使用しています。基本となる設定は英語部分の通りです。(guidance値を7にしていますが、先に述べているように0テンソルのため特に意味はありません。)
トレーニング条件詳細は英語部分を参照してください。

使用リソース(およびそのライセンス)

英語部分を参照してください。

ライセンス

Apache2.0

謝辞

black-forest-labs : 素晴らしいオープンソースモデルの公開に感謝します。
kohya-ss : 不可欠なトレーニングスクリプトの公開、迅速なアップデートに感謝します。
Kohaku-Blueleaf : データセット用スクリプトや各種学習条件の幅広い公開に感謝します。