Sign In
Cascade VS SDXL — Low VRAM

Low VRAM Adventures

[🔗link to the series announcement]

There is a new model architecture in town: Cascade Diffusion.

We can now play with it thanks to: Stable Cascade.

It is allegedly on par or faster than SDXL. What difference will it make for Low VRAM users?

Let's find out!

What is Cascade?

Read directly from StabilityAI on their 🔗huggingface model page.

In a few words, it's a model with a more compressed network that is therefore more efficient and should work faster or with better accuracy. It was also trained with over 1 billion more parameters than SDXL.

If you want to get started with Stable Cascade in ComfyUI, you can 🔗read my article and download my workflow.

Test Settings

I have run about 100 generations at 5 resolutions for both Stable Cascade and SDXL with very similar ComfyUI workflows.

A/ Test settings:

  • i58400

  • 16G RAM 2133Mhz

  • Nvidia GTX1060 6G

  • Windows 10 64bits

  • ComfyUI Portable + run_nvidia_gpu

  • Cuda 12

  • STEPS

    • Cascade: stage c: 20, stage b: 10

    • SDXL: sampler: 30

B/ Stable Cascade is new and there are only a handful of community models, therefore I am using the stock version. To be fair to Cascade, I am also using the stock SD XL v1.0 VAE fix.

C/ I am using ComfyUI self reported run time as reference. As it should also encompass VAE decoding. But it does not take into account model load times and writing to disk. I have found that those times were less repeatable because of other system processes. I believe that the self reported runtimes are good enough to draw comparisons.

D/ I used the same prompts and same resolutions to build a comparable dataset.

Test Results

CD is for Cascade Diffusion aka Stable Cascade.

Run times

Conversion as Pixels per second

Takeaway

This is only a small sample size but we can already see trends.

Stable Cascade is indeed faster than SD XL, the difference is tiny but noticeable.

The more important trend that I see is that Stable Cascade performance peaks around a resolution of 1180×2048.

What about quality?

My test prompts were not engineered to yield good results but to test how Stable Cascade and SD XL react. I have posted a portion of the generated images as posts for you to compare:

1/ Portrait

Prompt:

in a perfect black room, only the face of a woman is lit, she has imperfect skin, oily messy hair, a crooked nose, slightly imbalanced face, asymmetric mouth, she has a serious expression, wrinkles at the corner of her eyes betray her age of 42, her eyes are filled with the lights of distant galaxies, her eyes are extremely beautiful captivating the attention of the viewer,

Here I want to see if the weights of tokens decrease as the prompt goes. I also wrote what human would consider more important at the end.
Neither CD or SDXL captured the eye's prompt correctly.
For some reasons, SDXL gave unrealistic generations with a lot of errors.

2/ Executioner

There are two sets of images under this section. A large horizontal and a smaller vertical.

Vertical:

epic photograph, medieval execution, on a wooden platform raised above the crowd, the masked executioner raises an axe above his head, next to the executioner is a man kneeling resting his head on a wood trunk, a crowd fills the public square, the public square is surrounded by medieval houses and a church tower  is visible above the roofs

Horizontal

epic photograph, moody sky, medieval execution, on a wooden platform raised above the crowd, the masked executioner raises an axe above his head, next to the executioner is a man kneeling resting his head on a wood trunk, a crowd fills the public square, the public square is surrounded by medieval houses and a church tower  is visible above the roofs, sun rays going through grey clouds, the metal of the axe is shining, the air is heavy, the crowd is silent, a tensed feeling of dread is gripping the crowd, the executioner appears giant and strong like an ancient god of death

On both prompts I wanted to test multiple subject identification. Neither did separate the executioner and the convict.
On the large I also wanted to see if we would get repetition. CD did not but SDXL repeated the executioner multiple times.
CD seems to better handle cohesion on large images.

3/ Hellscape

Similar to the portrait but on a larger image. I described the main subject after the large picture.
CD put more focus on the main character described in the end. It did not attempt to really populate the area thought.
SDXL put more emphasis on the description of the landscape, sometimes omitting the character completely.
Neither captured the last details about the character, probably influenced by a bias for Satan's appearance in training data.

extremely vibrant and saturated colors, the burning desolate landscape of the kingdom of Hell in a wide panoramic landscape photograph, under a dark roof are thousands of small fires, lava seeps through cracks in the ground, ghouls and demons fill the space, men and woman in ragged clothes, sinners condemned to an eternal punishment are surrounded by demons, some men are pushed into flames by ghouls, some women are whipped by demons, in the center of the frame, the mysterious and seducing but extremely scary Satanic overlord himself, Satan is a fallen angel with red bat wings, Satan wears an ashy black coat, it is not possible to tell if Satan is male or female, satan's face is androgynous of a pale skin like porcelain but with two flaming red menacing eyes and a gaze that pierces the soul of anyone looking.

4/ Grim Reaper

We often struggle to get dark photographs with SD. I wanted to test low lights.

CD did great work of this prompt.

SDXL surprisingly completely ignored the photograph prompt and provided black ink drawings. Good but not really intended images.

high contrast dark photograph, washed colors, mostly black, the black silhouette of the grim reaper is barely noticeable against the dark grey of the dim lit room, a small window lets some moon light inside, half the face of an old lady asleep in her bed is lit by the gentle weak warm light of the moon, a dead candle smokes on the side of the bed, the atmosphere is very heavy, the reaper in its torn weathered hood is lurking above the old woman, the pointy end of the scythe almost touching her face, these are the last hours of a long lived life

5/ Space Disco

That prompt was to get an image that requires going away from training data.
Neither model got the dance pose, the special mirror suit nor light equipment.
Only SDXL had the lunar lander appear and that was once.
CD captured more of the prompt and returned more realistic images.

an American astronaut in full EVA suit is disco dancing on the Moon, his EVA suit is covered of hundreds of mirror facets, neon disco lights and party lasers are set up on the Moon regolith, the lunar lander module is visible in the distance, there is no air on the moon, the horizon is absolutely crystal clear, the sky is pitch black with a blue earth rising and the milky way visible, the beautiful lasers, neon and lights of the milky way and stars reflect on the shiny EVA suit, the atmosphere is joyful, light and festive.

6/ Pope clowning

This short prompt was to test many details.
Neither captured the red nose, they both used it as joggling balls.
Neither got the unicycle but CD did get a bicycle while SDXL just ignored it.
CD almost ignored the kittens while SDXL gave a ton of them.
The congregation was always present in CD but far and tiny. SDXL had a crowd in half the generations and those were closer to music groupies.

The location was well respected by both engines.

SDXL had a few repetition issues due to the large image size, CD did not.

the holy pope in his white dress is wearing a red clown nose and riding a unicycle while joggling kittens, in the middle of Saint Peter Cathedral in Rome, awe and wonder in the eyes of the congregation
4

Comments