Personal reminder when re-creating the sd-scripts environment.
CUI is great.Just copy and paste.
PC SPEC
CPU: Ryzen 5 5600G -> Ryzen 3 3100
MEM: 16GB -> 32GB
M/B: B450M S2H -> TUF GAMING B550M-PLUS
GPU: Radeon RX6650XT 8GB
OS: Ubuntu Desktop 22.04.3 LTS -> Ubuntu Server 22.04.4 LTS
ROCm: 5.6.1 -> 6.0.2
Python: 3.10.12
Install sd-scripts
git clone https://github.com/kohya-ss/sd-scripts.git #download the latest version that allows --fp8_base option to be used.
cd sd-scripts
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip wheel
pip install -r requirements.txt
pip uninstall torch torchvision
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
pip install tensorflow-rocm
# pip install protobuf==3.20.3
# pip install lycoris_lora prodigyopt dadaptation lion-pytorch wandb
export HSA_OVERRIDE_GFX_VERSION=10.3.0 #for RDNA2
accelerate config
- This machine No distributed training, no, no, no, all, fp16
mkdir log reg train out models #if you need
※ TypeError: ClusterConfig.__init__() got an unexpected keyword argument 'debug' で動かない場合の解決方法:
cache/huggingface/accelerate/default_config.yaml を編集
参照:https://github.com/bmaltais/kohya_ss/issues/1554
Install bitsandbytes-rocm-5.6 (old)
pip uninstall bitsandbytes #If bitsandbytes is already installed
git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6.git
cd bitsandbytes-rocm-5.6
# sudo apt install libstdc++-12-dev
export ROCM_HOME=/opt/rocm
make hip ROCM_TARGET=gfx1030
CUDA_VERSION=gfx1030 python setup.py install
pip install scipy #for bitsandbytes test
python -m bitsandbytes #bitsandbytes test
Install bitsandbytes
参照: https://github.com/TimDettmers/bitsandbytes/pull/756
pip uninstall bitsandbytes
git clone https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout c037a306e97ced3c452570132f66aac4e2964056
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -DCOMPUTE_BACKEND=hip -DAMDGPU_TARGETS=gfx1030 -S .
cmake --build . --config Release
pip install .
Install Onnx Runtime
pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-6.0.2/onnxruntime_rocm-inference-1.17.0-cp310-cp310-linux_x86_64.whl
SDXL train test sample prompt
LoRA: https://civitai.com/models/241115/asuka-aged-up-or-sdxl-lora
v0.1 (old)
# train dataset consists of 31 images from danbooru and google image.
export HSA_OVERRIDE_GFX_VERSION=10.3.0
time accelerate launch sdxl_train_network.py \
--max_data_loader_n_workers=1 \
--persistent_data_loader_workers \
--pretrained_model_name_or_path=models/hassakuXLSfwNsfwBeta_betaV01.safetensors \
--train_data_dir=train \
--output_dir=output \
--logging_dir=log \
--save_model_as=safetensors \
--network_module=networks.lora \
--output_name=new_loraXL \
--mixed_precision=fp16 \
--save_precision=fp16 \
--save_every_n_epochs=1 \
--seed=42 \
--resolution=832 \
--train_batch_size=1 \
--max_train_epochs=20 \
--optimizer_type=adamw8bit \
--unet_lr=1e-4 \
--lr_scheduler=cosine \
--network_dim=16 \
--network_args conv_dim=8 \
--network_train_unet_only \
--cache_latents \
--cache_text_encoder_outputs \
--no_half_vae \
--gradient_checkpointing \
--sdpa \
--mem_eff_attn \
--full_fp16
6200/6200 [5:16:50<00:00, 3.07s/it, avr_loss=0.116]
resolution limit by my GPU's spec is 896.
but use --based_fp8 option.can train reso=1024 with text_encoder train.
v0.61
export HSA_OVERRIDE_GFX_VERSION=10.3.0
time accelerate launch sdxl_train_network.py \
--max_data_loader_n_workers=1 \
--persistent_data_loader_workers \
--pretrained_model_name_or_path=models/ponyDiffusionV6XL_v6StartWithThisOne.safetensors \
--train_data_dir=train \
--output_dir=output \
--logging_dir=log \
--save_model_as=safetensors \
--network_module=networks.lora \
--output_name=new_loraXL \
--mixed_precision=fp16 \
--save_precision=fp16 \
--save_every_n_epochs=1 \
--seed=42 \
--resolution=1024 \
--max_bucket_reso = 2048 \
--bucket_no_upscale \
--train_batch_size=2 \ #when also learning text_encoder, batch_size=1
--max_train_epochs=10 \
--optimizer_type=Prodigy \
--learning_rate=1.0 \
--lr_scheduler=cosine \
--network_dim=8 \
--network_alpha=1 \
--network_args "betas=0.9,0.99" "weight_decay=0.01"
--max_token_length=225 \
--min_snr_gamma 5 \
--multires_noise_iterations=6 \
--multires_noise_discount=0.3 \
--network_train_unet_only \
--cache_latents \
--cache_latents_to_disk \
--cache_text_encoder_outputs \
--cache_text_encoder_outputs_to_disk \
--no_half_vae \
--gradient_checkpointing \
--sdpa \
--fp8_base
steps: 100%|█| 1950/1950 [3:29:12<00:00, 6.44s/it, Average key norm=0.925, Keys Scaled=19, avr_loss=0.08