How FLUX Klein uses QWEN

How the FLUX2 Klein Pipeline Uses Qwen3 and Processes Data

The key points are QWEN is sampled at 3 different layers (9,18,27) and then concated.
Any prompt longer then 512 is truncated (Cropped) not concated (combined)
Your 512 length prompts is used 3 times at three different levels, with negitives this gets very complicated far more then CLIP

Note: Everything beyond this point is AI written based on the diffusers code, COMFY may do some of its own shenanigans on QWEN.

The pipeline_flux2_klein.py implementation represents a shift away from traditional diffusion pipelines that rely on CLIP-style text encoders and UNet backbones. Instead, it uses a large language model (Qwen3) as a multi-layer feature extractor and converts both text and image latents into a unified sequence format processed by a transformer.

At a high level, the pipeline works by transforming both the prompt and the image latents into structured token sequences, enriching them with positional metadata, and feeding them into a model that operates over all tokens jointly.

Qwen3 as a Text Encoder (Not a Generator)

The pipeline uses Qwen3 strictly as a forward-pass encoder, not as a text generator. Instead of calling .generate(), it runs a single forward pass and explicitly requests hidden states:

output = text_encoder(
    input_ids=input_ids,
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
)

This disables autoregressive behavior (use_cache=False) and turns the model into a pure feature extractor. The pipeline does not use logits or decoded text—only internal representations.

Prompts Are Converted into Chat Format

Before tokenization, prompts are wrapped using a chat template:

messages = [{"role": "user", "content": single_prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

This means the model does not see raw text. Instead, it encodes something structurally closer to:

User: <prompt>
Assistant:

This adds extra tokens and context, aligning the input with how Qwen was trained.

Fixed-Length Token Sequences

All prompts are tokenized to a fixed length:

inputs = tokenizer(
    text,
    padding="max_length",
    truncation=True,
    max_length=512,
)

This enforces a sequence length of 512 tokens, regardless of prompt length. Longer prompts are truncated, and shorter ones are padded. As a result, all text embeddings have shape:

[B, 512, ...]

Multi-Layer Feature Extraction from Qwen

Rather than using only the final layer, the pipeline extracts hidden states from multiple intermediate layers:

out = torch.stack(
    [output.hidden_states[k] for k in (9, 18, 27)], dim=1
)

Each selected layer has shape:

[B, 512, 4096]

These are then rearranged and flattened:

out = out.permute(0, 2, 1, 3)
prompt_embeds = out.reshape(batch_size, seq_len, num_channels * hidden_dim)

Resulting in:

[B, 512, 12288]

Each token embedding is therefore a concatenation of representations from three different depths of the model, preserving both low-level and high-level linguistic features.

Latents Are Converted into Patch Tokens

The image latents are not processed as 2D feature maps. Instead, they are patchified:

latents = latents.view(B, C, H//2, 2, W//2, 2)
latents = latents.permute(0, 1, 3, 5, 2, 4)
latents = latents.reshape(B, C * 4, H//2, W//2)

This converts each 2×2 spatial region into a higher-dimensional channel representation. The latents are then flattened into a sequence:

latents = latents.reshape(batch_size, num_channels, height * width).permute(0, 2, 1)

Resulting in:

[B, H*W, C]

So instead of a grid, the image becomes a list of patch tokens, similar to Vision Transformers.

A Unified Coordinate System for All Tokens

One of the most distinctive features of this pipeline is the use of a shared coordinate system across all modalities. Tokens—whether from text or image—are assigned coordinates using:

coords = torch.cartesian_prod(t, h, w, l)

This produces 4D indices:

(T, H, W, L)

Text tokens vary along L (sequence position)
Image tokens vary along H, W (spatial position)
Multiple images vary along T (temporal index)

This allows all tokens to coexist in a single structured space, even though they originate from different modalities.

Everything Becomes a Sequence

After processing:

Text embeddings:
```
[B, 512, 12288]
```
Image latents:
```
[B, H*W, C]
```

Both are treated as sequences of tokens. Combined with their coordinate metadata, they can be processed together by a transformer that operates over the entire set.

Key Architectural Insight

This pipeline does not follow the standard pattern of:

Text encoder → pooled embedding → cross-attention → UNet

Instead, it implements something closer to:

Text tokens (multi-layer LLM features)
+ Image tokens (patchified latents)
+ Coordinates (T, H, W, L)
→ Unified transformer processing all tokens jointly

This design removes the distinction between “conditioning” and “generation.” Text is no longer compressed into a single vector or used only in cross-attention—it exists alongside image tokens as part of the same sequence.

Conclusion

The FLUX2 Klein pipeline fundamentally rethinks how diffusion models incorporate language. By using Qwen3 as a multi-layer feature extractor, preserving full token sequences, converting images into patch tokens, and introducing a shared coordinate system, it creates a unified token space where text and images are processed together.

The result is a system that behaves less like a traditional diffusion model with conditioning, and more like a multimodal transformer operating over structured tokens.