Sign In

Understanding Concept Geometry (New MIT Research!)

Understanding Concept Geometry (New MIT Research!)

Understanding Concept Geometry: From Theory to Better Prompts

Paper here: https://arxiv.org/html/2410.19750v1

UPDATE 10/31:

Further testing indicates this works best for Lora-less output, at high CFG (7.5-9.5). It tends to make those CFGs actually usable, even on models that are less suited for high CFG, like Illustrious XL.
More updates to come.

Author's Note

This article was developed with assistance from Claude AI to help organize and articulate these concepts. While the MIT paper's findings are peer-reviewed research, my applications to Stable Diffusion prompt engineering are still experimental and need further testing. The effectiveness of different syntax patterns (:: vs :) and prompt structures are based on initial results, and I'll be updating this article soon with practical image examples to demonstrate these techniques in action. I encourage community members to test these methods and share their findings as we explore these concepts together.

Also, most of my testing has been conducted on the Illustrious XL model, especially because of its novel nature and the at-the-moment specific prompt structure that it seems to need. I found that this concept-based geometric approach actually produced superior results compared to the conventional "quality tags first" structure (perfect quality, best quality, masterpiece, etc.) that's commonly recommended. This suggests that understanding and working with the model's natural concept organization might be more effective than traditional prompt patterns.

More image examples and practical demonstrations coming soon.

Introduction

Recently, I've been diving deep into a fascinating MIT research paper titled "The Geometry of Concepts: Sparse Autoencoder Feature Structure." While studying this paper and experimenting with prompts, I've discovered some interesting connections that have significantly improved my prompt engineering. Before I share these insights, let's establish some fundamental concepts that will help everything make more sense.

Understanding the Basics

### Latent Space: The Universe of Concepts

Think of latent space as an enormous multidimensional universe where every possible image exists somewhere. When you use Stable Diffusion, you're essentially:

1. Converting your prompt into coordinates in this space

2. Finding a location that matches your description

3. Moving through this space during the diffusion process to create your image

It's like having an infinite art gallery where every possible artwork exists - your prompt is just giving directions to find the one you want.

### How Stable Diffusion Understands Prompts

When you write a prompt, Stable Diffusion:

1. Breaks it into tokens (words and concepts it understands)

2. Maps these tokens to points in latent space

3. Uses these points as guides for generating your image

Understanding this process is crucial because it explains why prompt structure and weights matter so much - you're essentially giving the AI more precise coordinates in this concept space.

The MIT Paper's Key Findings

The researchers discovered that AI models organize concepts at three distinct scales:

1. Crystal Scale (Small): Concepts form geometric patterns like parallelograms

2. Brain Scale (Medium): Features cluster into functional "lobes" like a brain

3. Galaxy Scale (Large): The overall structure follows specific patterns

Let me explore each and show how they apply to prompting.

Important note for all AI model usage

These concepts should apply across all model families since they all use latent space for training and data structuring - whether for text, images, video, or music generation. Any model operating in latent space would likely follow these patterns as it, based on current research, seems to be the most efficient way to structure weights/space/relationship data for any given model.

This is somewhat of a heady conclusion/speculation on my part, but as a general rule of thumb when it comes to science and especially energy concerns + efficiency:

When systems are not given rigid guidelines in terms of how to organize themselves by humans,

Systems will fundamentally organize themselves in a way that makes them most efficient

for them to maintain their current state, whether that's the way gasses distribute across a container, reactions between different chemicals, the way that star systems will distribute their planets, etc.

I believe that this concept geometry finding is just another example of that idea.

Organizing these concepts into standardized geometrical shapes makes it so that a given AI model, when creating an output, operates in a way most efficient in both energy expenditure and time taken for said model to traverse the latent space in order to reach the desired output to the user.

### 1. Crystal Structure: How Concepts Combine

Imagine concepts as points in space. The researchers found that related concepts often form geometric patterns.

  • Parallelogram Structures

    • Concepts form four-point patterns where relationships between points are preserved

    • These structures maintain consistent "semantic distances"

    • Quality of these relationships improves when removing "distractor" dimensions

  • Function Vectors

    • The difference between two concepts creates a "function vector"

    • These vectors can be applied to other concepts

    • Multiple vectors can combine to create new concepts

  • Key Discovery: Distractor Features

    • Some dimensions (like word length in text) can mask these relationships

The classic example is:

forming a parallelogram where:

- The difference between "man" and "woman" represents the concept of "gender"

- The difference between "man" and "king" represents the concept of "royalty"

How This Applies to Stable Diffusion

In art generation, I've observed similar patterns:

portrait + watercolor_style = watercolor_portrait

This explains why, in my findings, certain prompt structures worked well:

masterpiece portrait::1.4, [artist] style::1.2

The double-colon syntax (::) seems to create stronger geometric relationships between concepts, which might explain why I've found it often produces better results than single-colon syntax (:) for core concepts.

### 2. Brain-Like Organization: Functional Groups

The research found that features cluster into specialized "lobes" that handle specific types of content. This is similar to how a human brain has different regions for different functions.

At this intermediate scale, the researchers discovered:

  1. Functional Lobes

    • Features cluster into specialized regions

    • Each region handles specific types of content

    • Strong spatial modularity similar to brain regions

  2. Co-occurrence Patterns

    • Features that fire together cluster together

    • These clusters show strong statistical significance

    • Different clustering metrics reveal different functional relationships

  3. Cross-Lobe Interactions

    • Lobes communicate in structured ways

    • Information flows between functional groups

    • Patterns mirror neural network architecture

Application to Prompting

This finding suggests we should group related concepts together in prompts. Here's the structure I've found most effective:

# Artistic/Style Group

(masterpiece::1.4), (digital art::1.3), (artist style::1.2)

# Physical/Subject Group

(character details:1.1), (pose:1.0)

# Environment Group

(background elements:0.9), (lighting:0.8)

# Technical Group

(detailed:0.8), (high quality:0.8)

### 3. Galaxy-Scale Structure: Hierarchical Organization

The research found that feature importance follows a power-law distribution, with some features being dramatically more important than others.

To elaborate slightly, this largest scale, the research revealed:

  1. Power Law Distribution

    • Feature importance follows a mathematical power law

    • Creates a "fractal cucumber" shape of concept relationships in high-dimensional space

    • Middle layers show steepest power law slopes

  2. Layer Dependencies

    • Different layers show different organizational patterns

    • Middle layers act as information bottlenecks

    • Pattern steepness varies systematically across layers

  3. Clustering Patterns

    • Strong hierarchical organization

    • Systematic variation across model layers

    • Complex interaction between 3+ scales

Prompting Implementation

This (speculatively) suggests using decreasing weights for less important elements:

(Again, speculation. I find the reverse order of this weighting works better in some cases)

# Primary Elements (1.3-1.4)

masterpiece portrait::1.4

# Style Elements (1.1-1.2)

artist style::1.2

# Details (0.8-1.0)

(specific details:0.9)

My Discoveries About Syntax

More than anything, I believe the parenthetical grouping of similar concepts and types of concepts, in proper order, is the most important factor in getting really what you want out of any given prompt.

However, through extensive testing, I've found that mixing syntax types can be additionally effective:

1. Double-colon (::) for primary elements:

masterpiece portrait::1.4, artist style::1.2

2. Single-colon (:) for details:

(specific details:1.0), (background elements:0.9)

This mixed approach seems to leverage different types of concept relationships:

- :: for strong geometric relationships (crystal scale)

- : for smoother blending of details (brain scale)

Examples

Grouping/no grouping + colons

first with parentheses grouping of concepts:

masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, (elegant feminine figure), (blue eyes, confident smile), (form-fitting blue textured bodysuit), (regal fur trim), (majestic castle background), (dramatic outdoor lighting), (highly detailed, perfect lighting)

Second with no parentheses grouping but with double colons:

masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, elegant feminine figure::1.1, blue eyes, confident smile::1.1, form-fitting blue textured bodysuit::1.1, regal fur trim::1.1, majestic castle background::1.1, dramatic outdoor lighting::1.1, highly detailed, perfect lighting::1.1

third with no parentheses grouping and with single colons:

masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, elegant feminine figure:1.1, blue eyes, confident smile:1.1, form-fitting blue textured bodysuit:1.1, regal fur trim:1.1, majestic castle background:1.1, dramatic outdoor lighting:1.1, highly detailed, perfect lighting:1.1

Here, the first probably adheres to the prompt the most in terms of the castle background, but the second picture has the best detail, styling, and composition arguably versus the other two photos.

Double vs Single Colon

first with double colons and second with single colon.

Notice the better overall details, adherence, styling, artifact presence, overall composition, and shading being better in the first pictures:

masterpiece portrait of Princess Midna::1.4, Yoshitaka Amano style::1.2, ethereal fantasy art::1.1, (elegant feminine figure::1.1), (blue eyes, confident smile::1.0), (form-fitting blue textured bodysuit::1.0), (regal fur trim::0.9), (majestic castle background::0.9), (dramatic outdoor lighting::0.8), (highly detailed, perfect lighting::0.8)

masterpiece portrait of Princess Midna:1.4, Yoshitaka Amano style:1.2, ethereal fantasy art:1.1, (elegant feminine figure:1.1), (blue eyes, confident smile:1.0), (form-fitting blue textured bodysuit:1.0), (regal fur trim:0.9), (majestic castle background:0.9), (dramatic outdoor lighting:0.8), (highly detailed, perfect lighting:0.8)

Colon syntax mixing + order of mixing:

double, single only, double first then single, single first then double.

In my opinion, number three (double colon then single) looks the best in regards to adherence, style and details.

Prompts 1-4:

masterpiece::1.4, masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, (elegant feminine figure::1.1), (blue eyes, confident smile::1.1), (form-fitting blue textured bodysuit::1.1), (regal fur trim::1.1), (majestic castle background::1.1), (dramatic outdoor lighting::1.1), (highly detailed, perfect lighting::1.1)
masterpiece:1.4, masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, (elegant feminine figure:1.1), (blue eyes, confident smile:1.1), (form-fitting blue textured bodysuit:1.1), (regal fur trim:1.1), (majestic castle background:1.1), (dramatic outdoor lighting:1.1), (highly detailed, perfect lighting:1.1)
masterpiece::1.4, masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, (elegant feminine figure:1.1), (blue eyes, confident smile:1.1), (form-fitting blue textured bodysuit:1.1), (regal fur trim:1.1), (majestic castle background:1.1), (dramatic outdoor lighting:1.1), (highly detailed, perfect lighting:1.1)
masterpiece:1.4, masterpiece portrait of Princess Midna, Yoshitaka Amano style, ethereal fantasy art, (elegant feminine figure::1.1), (blue eyes, confident smile::1.1), (form-fitting blue textured bodysuit::1.1), (regal fur trim::1.1), (majestic castle background::1.1), (dramatic outdoor lighting::1.1), (highly detailed, perfect lighting::1.1)

Weight Guidelines I've Developed

Based on the paper's hierarchical findings and my testing:

1. Core Concepts: 1.3-1.4

- Main subject

- Quality markers

- Primary style

2. Major Attributes: 1.1-1.2

- Artist style

- Key features

- Important details

3. Supporting Elements: 0.8-1.0

- Background

- Lighting

- Technical details

(Again, speculation. I find the reverse order of this weighting works better in some cases)

Practical Templates

Here's my optimized template that's proven effective:

# Core Concept (Crystal Structure)

masterpiece portrait of [subject]::1.4,
[artist] style::1.2,
[art style] art::1.1,

# Details (Functional Groups)

(physical description:1.1),
(facial features, expression:1.0),
(clothing details:1.0),
(accessories:0.9),
(background setting:0.9),
(lighting style:0.8),
(technical details:0.8)

Example Implementation

masterpiece portrait of ethereal sorceress::1.4,
Alphonse Mucha style::1.2,
art nouveau fantasy::1.1,
(slender ethereal figure:1.1),
(silver eyes, serene expression:1.0),
(flowing iridescent robes:1.0),(crystal staff:0.9),(magical library background:0.9),
(mystical lighting:0.8),(highly detailed, magical effects:0.8)

Advanced Applications

For more complex images, I've developed these specialized templates:

Character Portrait Template

masterpiece portrait::1.4,

[artist] style::1.2,

[art style] art::1.1,

(physical description:1.1),

(facial features, expression:1.0),

(clothing details:1.0),

(pose, gesture:0.9),

(background:0.9),

(lighting:0.8),

(technical details:0.8)

Landscape Template

masterpiece landscape::1.4,

[artist] style::1.2,

[mood] atmosphere::1.1,

(terrain features:1.1),

(weather conditions:1.0),

(environmental details:1.0),

(time of day:0.9),

(lighting effects:0.8),

(technical quality:0.8)

Troubleshooting Common Issues (speculative!)

Based on my testing, here are solutions to common problems:

1. Style Not Strong Enough

- Increase style weight to 1.3-1.4

- Move style terms earlier in prompt

- Group style terms together

2. Lost Detail

- Increase detail weights to 1.0-1.1

- Add specific detail descriptors

- Group related details together

3. Inconsistent Results

- Use more specific core concepts

- Maintain consistent weight hierarchy

- Group related concepts together

Conclusion

Understanding how Stable Diffusion organizes concepts in its latent space has transformed how I write prompts. By aligning our instructions with the model's natural organization of concepts, we can achieve more consistent and higher-quality results.

The key principles I've found most valuable are:

- Group related concepts together

- Use appropriate weights for different elements

- Mix syntax types strategically

- Maintain hierarchical structure

I encourage you to experiment with these templates and principles. Every model and style might need slight adjustments, but these fundamental concepts should provide a strong foundation for developing your own effective prompts.

---

This article is based on research from "The Geometry of Concepts: Sparse Autoencoder Feature Structure" (MIT, 2024) and my personal experimentation with Stable Diffusion prompting. Feel free to share your results and experiences with these techniques.

44

Comments