Analyzing and Addressing Artifacts in Web-UI with XL-models

Around the time animagine started to appear, there were sporadic reports that entering certain prompts would produce artifacts. When you hear "artifact," you might think of ancient weapons, but in this context, an artifact refers to images that are not noise but somehow off.

For those who just want the solution without the details, use this.

https://github.com/hako-mikan/sd-webui-prevent-artifact

There were reports that artifacts could appear by changing the order or intensity of the input prompts, but the conditions for their occurrence were unclear, making it feel like grasping at clouds. However, there has been talk of pinpointing the cause.

Artifacts occur in the Web-UI series but not in Comfy, etc. Given that similar artifacts appear when the intensity is increased, it is speculated that there is a bug in the Web-UI's intensity calculation.

On one hand, this seems plausible, but it also feels strange. If there was a bug in the Web-UI's intensity calculation, it should have been pointed out before now, but reports only started after animagine became popular. Speaking of intensity bugs, for example, setting (mikan:1.3) would strengthen the mikan tensor by 1.3 times. However, stacking these calculations and ending up with an intensity like 10 seems odd. Even with complex prompts like (kotatu,(mikan:1.3), cat:1.5), the issue shouldn't occur unless the combined intensity exceeds 2.

There were also reports that artifacts could appear just by changing the order of prompts, suggesting that intensity alone does not explain the phenomenon.

What could be the cause? I decided to embark on an adventure through the code. The Web-UI code that processes prompts is quite intricate, moving back and forth in several places.

First, the entered prompts are divided and processed according to their intensity.

For a prompt like

a girl in (kotatu:1.3) eating mikan

it would be divided as follows:

a girl in : 1.0
kotatu : 1.3
eating mikan : 1.0

The intensity calculation happens here, but it seemed there was no room for bugs.

Then, the prompts are converted into tokens, passed to transformers, and become tensors. The suspect part was the process surrounding the multiplication of these intensities. Below is the code, with an easy-to-understand explanation by GPT-chan, found in the modules folder's sd_hijack_clip.py.

def process_tokens(self, remade_batch_tokens, batch_multipliers):
    tokens = torch.asarray(remade_batch_tokens).to(devices.device)
    # Convert remade_batch_tokens to a PyTorch array and move to the specified device
    if self.id_end != self.id_pad:
        for batch_pos in range(len(remade_batch_tokens)):
            index = remade_batch_tokens[batch_pos].index(self.id_end)
            tokens[batch_pos, index+1:tokens.shape[1]] = self.id_pad
    # Distinguish between SD1 and SD2 handling. In SD1, padding and text end tokens are the same, but they differ in SD2.
    # This block fills positions after the text end token with padding tokens in each batch.

    z = self.encode_with_transformers(tokens)
    # Encode the tokens with a Transformer model

    pooled = getattr(z, 'pooled', None)
    # Get the pooled representation from the encoded output. Return None if no pooled representation exists.

    batch_multipliers = torch.asarray(batch_multipliers).to(devices.device)
    original_mean = z.mean()
    # Calculate the average value of the encoded output and keep it as the original mean

    z = z * batch_multipliers.reshape(batch_multipliers.shape + (1,)).expand(z.shape)
    # Apply multipliers for each batch element. This modifies the encoded representation.

    new_mean = z.mean()
    # Calculate the average value of the modified encoded output

    z = z * (original_mean / new_mean)
    # Adjust with the new mean to restore the original average value. This aims to prevent artifacts.

    if pooled is not None:
        z.pooled = pooled
    # Reassign the pooled representation to the modified encoded output if it exists.

    return z
    # Return the adjusted encoded output

The suspicious part is:

z = z * (original_mean / new_mean)

In programming, division always has the potential to produce unexpected outputs. If the new_mean value is smaller compared to the original_mean, the tensor's intensity could unexpectedly increase.

According to the original code, it's written to prevent artifacts rather than anything else. However, since this code predates the release of the XL model, there's a possibility that it might not be compatible with the XL.

Original_mean and new_mean represent the tensor's average value before and after intensity calculation, respectively. Typically, intensities use values greater than 1, so the ratio (original_mean / new_mean) should not deviate far from 1. Perhaps with animagine, there are tokens that significantly swing the tensor into negative values, and when these tensors are mixed and the intensity is increased, the average value significantly decreases, causing the (original_mean / new_mean) calculation to increase and resulting in corruption. Alternatively, the signs of original_mean and new_mean could reverse due to intensity calculation.

To test this hypothesis, I attempted to reproduce the problem. Instead of randomly searching for suspicious tokens, I calculated the sign of the mean for tensors generated from all tokens. It was found that tensors could be either positive or negative.

The word "eni" was the strongest in the negative tensor ranking, and "1" was the strongest among positive tensors. What is "eni"?

Combining these elements, the problem could be reproduced. Using the prompt "1 2 am:1.3 eni" resulted in corruption. Looking at the calculated values of original_mean and new_mean, we find that the original_mean is -0.0003, and the new_mean is -7.8293e-05. Consequently, the multiplier becomes 3.83, leading to failure. Furthermore, increasing the intensity from 1.3 results in the inversion of positives and negatives. This combination of prompts appears to be exactly on the boundary between positive and negative. Any slight change in the prompt prevents the failure. This is consistent with previous reports. Therefore, it would be appropriate to deactivate the following code.Changing the prompt even slightly prevents the corruption. This matches previous reports.

Therefore, the following code should be disabled:

z = z * (original_mean / new_mean)

Disabling it resolved the issue as expected without causing further problems. To avoid waiting for issues to be filed with A1111, forge, and others, I created a script to solve this. As usual, install it and set it up from settings to disable it.

https://github.com/hako-mikan/sd-webui-prevent-artifact

Analyzing and Addressing Artifacts in Web-UI with XL-models

Comments