Type | |
Stats | 47 0 |
Reviews | (10) |
Published | Jan 20, 2025 |
Base Model | |
Training | Steps: 295,000 Epochs: 3 |
Usage Tips | Strength: 1 |
Hash | AutoV2 EAD76FE90E |
Welcome to Terminus Relay!
This model series is the culmination of millions of training steps burnt into the Stable Diffusion 3.5 abyss. The name Relay refers to the hand-off of the baton in the race to build a usable SD 3.5 Medium ecosystem.
Currently, a 3.5 Medium version is available in LyCORIS LoKr format. This adapter is reasonably small at 350MiB and so it should run on most consumer hardware!
The 3.5 Medium model was selected due to the truly challenging nature of training it, and the promising potential in a 2.6B parameter 16ch VAE model. If we could get it to work, well, great things could come from this foundation!
The v1 version of Terminus Relay has roughly 55,000 steps of finetuning on very high quality photos, cinematic still extracts, and typography images to improve the model's understanding of text.
The meme potential of this model is high!
For v2, it has reached 295,000 steps of finetuning on mostly high quality images containing a lot of text (signs, paper handwriting, etc) and ~28k stock photos pulled from a pre-AI era dataset (none of these had watermarks).
The dataset size was actually reduced between v1 and v2 to focus the model in more for composition and prompt adherence than direct anatomical improvements.
The v1 model may be more creative but the v2 model is more stable. The earlier version requires a higher CFG around 5-8 but the newer ones will require lower CFG from 2-4.
Training details
SimpleTuner configuration
This goes into config/sd3/config.json
{
"--resume_from_checkpoint": "latest",
"--quantize_via": "cpu",
"--data_backend_config": "config/sd3/multidatabackend.json",
"--aspect_bucket_rounding": 2,
"--seed": 42,
"--minimum_image_size": 0,
"--disable_benchmark": false,
"--output_dir": "output/sd3",
"--lora_type": "lycoris",
"--lycoris_config": "config/sd3/lycoris_config.json",
"--max_train_steps": 300000,
"--num_train_epochs": 0,
"--checkpointing_steps": 5000,
"--checkpoints_total_limit": 5,
"--hub_model_id": "sd35m-photo-1mp",
"--push_to_hub": "true",
"--push_checkpoints_to_hub": "true",
"--tracker_project_name": "lora-training",
"--tracker_run_name": "sd35m-1mp",
"--report_to": "wandb",
"--model_type": "lora",
"--pretrained_model_name_or_path": "stabilityai/stable-diffusion-3.5-medium",
"--model_family": "sd3",
"--train_batch_size": 4,
"--gradient_checkpointing": "true",
"--gradient_accumulation_steps": 1,
"--caption_dropout_probability": 0.1,
"--resolution_type": "pixel_area",
"--skip_file_discovery": false,
"--resolution": 1024,
"--validation_seed": 42,
"--validation_steps": 5000,
"--validation_resolution": "1024x1024",
"--validation_negative_prompt": "ugly, cropped, blurry, low-quality, mediocre average",
"--validation_guidance": 6.0,
"--validation_guidance_rescale": "0.0",
"--validation_num_inference_steps": "30",
"--validation_prompt": "A photo-realistic image of a cat",
"--mixed_precision": "bf16",
"--optimizer": "bnb-adamw8bit",
"--learning_rate": "5e-5",
"--max_grad_norm": 0.1,
"--grad_clip_method": "value",
"--lr_scheduler": "constant_with_warmup",
"--lr_warmup_steps": 10000,
"--base_model_precision": "int8-quanto",
"--vae_batch_size": 1,
"--validation_torch_compile": "true",
"--validation_lycoris_strength": 1.0,
"--webhook_config": "config/sd3/webhook.json",
"--compress_disk_cache": "false",
"--evaluation_type": "clip",
"use_ema": true,
"ema_validation": "comparison",
"ema_update_interval": 25,
"--delete_problematic_images": "true",
"--disable_bucket_pruning": true,
"--lora_rank": 128,
"--lora_alpha": 128,
"--flux_schedule_shift": 3,
"--validation_prompt_library": true
}
If you wish to continue finetuning this model in particular, use --init_lora=/path/to/file.safetensors
Place the following into config/sd3/lycoris_config.conf
{
"bypass_mode": true,
"algo": "lokr",
"multiplier": 1.0,
"full_matrix": true,
"linear_dim": 10000,
"linear_alpha": 1,
"factor": 4,
"apply_preset": {
"target_module": [
"Attention",
"FeedForward"
],
"module_algo_map": {
"FeedForward": {
"factor": 4
},
"Attention": {
"factor": 2
}
}
}
}
And for the dataset:
154 T5 tokens
77 CLIP tokens
Resolution ~1024px area aspect bucketed data
CogVLM and other language model created captions
No particular focus on NSFW, anime; only high quality photo data