SD3.5: separate prompts for clip models

I'm not an expert, I'm just experimenting.
Translated in English with online translator. Sorry. Lazy.

Мы знаем, что SD3.5 использует три модели кодировщиков: clip_l, clip_g и t5, в отличие от Flux, который использует только clip_l и t5. Все эти модели воспринимают промпты по-разному.
Есть узел ClipTextEncodeSD3, в котором можно раздельно задать промпты для всех трёх кодировщиков. Я поэкспериментировала с этим узлом, и хочу поделиться результатами.
Модель, которую я выбрала - конечно из-за её скорости - SD3.5 Medium. Разделяю только позитивный промпт, негативный - общий.

Помним, что clip_g и t5 лучше понимают естественный язык, clip_l работает с тегами.
Когда мы используем простой ClipTextEncode, мы для всех трёх кодировщиков даём одинаковый промпт.
Для начала я взяла самый простой, замок в долине. (В негативы Neuschwanstein, знаменитый замок в Германии, я добавила потому, что если этого не сделать, SD рисовать будет только этот замок. В принципе, он и так рисовал его в простых промптах, несмотря на запрет. Так же добавила Гауди, потому что его стиль уникальный, и это не то, что мне сейчас нужно. )

We know that SD3.5 uses three encoder models: clip_l, clip_g and t5, unlike Flux, which uses only clip_l and t5. All these models perceive prompts in different ways. There is a ClipTextEncodeSD3 node in which you can set prompt separately for all three encoders. I've experimented with this node, and I want to share the results. The model I chose, of course because of its speed, is SD3.5 Medium. I separate only the positive prompt, the negative one is general.

Remember that clip_g and t5 understand natural language better, clip_l works with tags.
When we use a simple ClipTextEncode, we give the same prompt for all three encoders.
To begin with, I took the simplest castle in the valley. (I added Neuschwanstein, the famous castle in Germany, to the negatives because if this is not done, then only this castle will be painted. Basically, he was already painting it in simple designs, despite the ban. I also added Gaudi, because his style is unique, and that's not what I need right now.)

Prompt:
Positive: "White magic castle in dreamy valley"

Negative: "Ugly, bad quality, low quality, low resolution, draft, deformed, unprofessional, unrealistic architecture, blur, blurry, vague, Gaudi, Neuschwanstein"

Кодировщик t5 лучше всех понимает сюжет промпта, даёт композицию, объекты. На runcomfy.com можно узнать, что clip_l это локальный вектор условий, имеет меньший охват, но более высокую детализацию. clip_g Это глобальный вектор условий, может влиять на общую структуру и композицию. По ощущениям, clip_l добавляет чёткости и насыщенности, небольшой мультяшности, прорабатывает детали, clip_g - как будто работает над реалистичностью, меняет общую картинку.
В любом случае, сюжет даёт t5.

The t5 encoder understands the plot best of all, provides composition and objects. On runcomfy.com you can find out that clip_l is a local vector of conditions, has less coverage, but higher detail. clip_g is a global vector of conditions that can affect the overall structure and composition. It feels like clip_l adds clarity and saturation, a little cartoony, works out the details, clip_g seems to be working on realism, changing the overall picture. In any case, t5 gives plot.

Prompt:
"White magic castle in dreamy valley"

Добавление деталей в композицию лучше всего отрабатывает t5:

Adding details to a composition works best with t5:

Prompt:
"White magic castle in dreamy valley, knights in valley"

На третьей картинки появились бледненькие фигурки человечков. Но лучше всего их отработают все три кодировщика сразу:
Pale figures of little men appeared in the third picture. But best of all, all three encoders will work them out at once:

Более общие параметры лучше отрабатывает clip_g.
clip_g performs better on more general parameters.

Prompt:
"White magic castle in dreamy valley, pixel art"

Пиксель-арт получился только у clip_g. Но, если вписать pixel art и в поле clip_l, и в поле t5, то всё получится. Но хуже, чем у clip_g:
Only clip_g turned out to be pixel art. However, if you enter pixel art in both the clip_l field and the t5 field, then everything will work out. But worse than clip_g:

Запрос "красная крыша" не понял ни один кодировщик, но clip_g добросовестно наделал красного:
The query "red roof" was not understood by any encoder, but clip_g faithfully made red:

Все вместе они тоже до конца не понимают:

Together, they don't fully understand either:

Однако, чёрно-белая картинка лучше всего вышла у t5. Но clip_g хотя бы пытался.
However, the black-and-white picture came out best with the t5. But at least clip_g tried.

Prompt:
"White magic castle in dreamy valley, black and white"

А вот что у них получится вместе:

But here's what they'll do together:

Для чего же нам вообще нужно разбивать промпт по кодировщикам?
Это влияет на результат. Позволяет усилить некоторые параметры, изменить что-то в картинке.

Возьмём более сложный промпт. Если его скормить всем трём кодировщикам, получится так:

Why do we even need to split prompt by encoders? This affects the result. It allows you to enhance some parameters, change something in the picture. Let's take a more complex prompt. If you feed it to all three encoders, it looks like this:

Prompt: "(White magic castle in dreamy valley:1.4). White-stone tall spires rise above mountains, mountain valley and many streams and ponds. Sunlight plays in the colorful stained glass windows. God rays knock out the sunbeams on the snow-white walls. The valley blooms with a carpet of meadow flowers, quiet ponds are covered with water lilies, tiny waterfalls break the water surface and the rays of the sun play a rainbow in the foam of the waterfalls. Iridescent butterflies and birds flutter over the flowers. The quiet, cozy valley is surrounded by impassable cliffs where dragons nest.

Fantasy, charming captivating enchanting atmosphere,

Bright, clear details, architectural masterpiece, panoramic landscape, High Angle Shot"

clip_l и clip_g в одиночку я показывать не буду, это не сильно отличается от уже выложенного короткого промпта. Ниже - то, как т5 ведёт себя один, и как он отрабатывает в паре.
I will not show clip_l and clip_g alone, it is not much different from the already posted short prompt. Below is how the t5 behaves alone, and how it performs in pairs.

Если, например, t5 дать полный промпт, а clip_l скормить короткий, то получится вот так:
If, for example, t5 is given a full prompt, and clip_l is fed a short one, then it turns out like this:

Prompt clip_l: "White magic castle in dreamy valley"
Prompt clip_g empty

Prompt t5: complex prompt from the previous picture

Стоит основную часть промпта вписывать в поле t5. Если, например, дать длинный промпт clip_g, а короткий - clip_l, получится так:

It is worth entering the main part of the product in the t5 field. For example, if you give a long clip_g prompt and a short clip_l prompt, it looks like this:

Длинный промпт для t5, короткие - для clip_g и clip_l:
Long prompt for t5, short ones for clip_g and clip_l:

Prompt clip_l: "White magic castle in dreamy valley"
Prompt clip_g "tall spires, colorful stained glass windows, White-stone walls, water lilies, "

Prompt t5: (White magic castle in dreamy valley:1.4). White-stone tall spires rise above mountains, mountain valley and many streams and ponds. Sunlight plays in the colorful stained glass windows. God rays knock out the sunbeams on the snow-white walls. The valley blooms with a carpet of meadow flowers, quiet ponds are covered with water lilies, tiny waterfalls break the water surface and the rays of the sun play a rainbow in the foam of the waterfalls. Iridescent butterflies and birds flutter over the flowers. The quiet, cozy valley is surrounded by impassable cliffs where dragons nest.

Fantasy, charming captivating enchanting atmosphere,

Bright, clear details, architectural masterpiece, panoramic landscape, High Angle Shot"

Для более реалистичных изображений дополним негативы:
For more realistic images, we'll add some negatives:

Negative prompt:
"Ugly, bad quality, low quality, low resolution, draft, deformed, unprofessional, unrealistic architecture, blur, blurry, vague, Gaudi, Neuschwanstein, cartoon, illustration, anime"
Positive prompt:

clip_l: empty
clip_g "amazing digital art with high quality, 64K UHD wallpaper"

t5: the same long one

Positive prompt:

clip_l: empty
clip_g "Realistic landscape photo made by professional photojournalist. Raw photo of wildlife in high resolution."

t5: the same long one

Я приложу Workflow с узлом ClipTextEncodeSD3, но он очень просто заменяет простой ClipTextEncode.

И, пожалуйста, напишите мне, если я в чём-то ошибаюсь, я пока не очень хорошо знаю теорию.

I will attach a Workflow with a ClipTextEncodeSD3 node, but it very simply replaces a simple ClipTextEncode.

And please write to me if I'm wrong about something, I don't know the theory very well yet.

SD3.5: separate prompts for clip models

Comments