CADS Operation Guidance

Condition-Annealed Diffusion Sampler (arxiv) is a sampling strategy to improve the output diversity of diffusion model (same goal as FDG), and this article aims to provide support for one of its implementation: asagi4/ComfyUI-CADS.

Params

FID score = image quality (and maybe prompt adherence); Recall score = image diversity

noise scale (s):
- According to paper (Figure 7 in section 4, same for t and rescale factor below), 0.1 achieves the best FID score, while higher scale will drastically raise the FID score and moderately imporve Recall score at the same time.
- When s≥0.25, you may get more noisy garbage, raising rescale is likely a must.
t1/t2 (τ1/τ2): when to start/stop applying annealing function
- the de-noise(image generation) progress CADS sees (t) starts from 1 to 0
  (inside the node, initial value of t depends on start step and total steps, see below),
  and unconditional guidance (CFG=1) will be used before τ2,
  normal guidance will be used after the progress reaching τ1.
  If τ1>τ2, unconditional guidance will only be used right before τ1.
  The related formula is Eq. (2) in section 3.1.
- According to paper, FID score hits its best when τ1 is around 0.6, and it will gradually get worse if you raise τ1 up. Recall score reaches the top when τ1 is at 0.2, but FID score is obtrusively worsen at the same time.
rescale (ψ): normalization process factor
- According to paper, choose 1.0 for the best FID, 0.0 for the best Recall.
  The process will work when it is set to 0 or 1, don't get misled.
  Values near 1 are safer, reducing noisy output.
- Eq. (4) in section 3.1 has shown its usage.
start step/total steps: t(de-noise progress CADS sees) value manipulator
- t starts from 1-(start_step/total_steps) to 0,
- If start step is not lower than total steps, sampler's timestep will be used, and both start_step and total_steps value will be ignored.
- It will be fine if you make normal guidance work in the end.
  make sure 1-(start_step+steps(value in nodes like KSampler))/total_steps<<τ1, or noisy garbage will show up in your output.

apply to:
- both/cond/uncond
  - You can see a concept marked as "CONDITIONING" in ComfyUI, most of the time it is produced by the node "CLIP Text Encode (Prompt)", and it is treated as "cond".
    In the case of regular txt2img process, both cond and uncond will be used. If you leave the prompt empty, only uncond will get applied.
- cond modifies higher sigmas (composition) more, while uncond focuses on the lower one (details).
key:
- both/y/c_crossattn
  - 'y' is created by Text Encoder, and then used by Cross Attention inside diffusion model.
- When using c_crossattn with apply_to=cond, diversity is more likely to destroy prompt alignment.
  When using c_crossattn with apply_to=uncond, prompt alignment is okay, but if rescale is set closer to 1 at this time, unless uniform noise is chosen, the model will generate less brightness. (if you choose uniform/exponential noise with this set, high rescale is necessity unless you want high saturation)
noise type:
- normal/uniform/exponential
  - normal is the safest option
- It determines what type of noise will be used inside CADS.

Suggestions

If you see your image got destroyed by noise (sorted by priority):
- decrease noise scale
- increase t2 parameter
- increase the rescale parameter
- use linear timestep (let total_steps higher than start_step)
- use safer apply_to/key/noise_type combination
  any+c_crossattn+normal, uncond+both+normal,
  both+c_crossattn+uniform/exponential
If you see your image quality is stable enough but lacks diversity (sorted by priority):
- decrease rescale
- decrease t2
- increase noise scale
- switch to risky apply_to/key/noise_type combination
- mess with start_step/total_steps
- decrease t1
If you need more prompt alignment:
- set t1 closer to 0.6, and try to find out the best t1 value for your model.
- keep noise scale closer to 0.1, or raise rescale.
- set apply_to to uncond