How the FLUX2 Klein Pipeline Uses Qwen3 and Processes Data
The key points are QWEN is sampled at 3 different layers (9,18,27) and then concated.
Any prompt longer then 512 is truncated (Cropped) not concated (combined)
Your 512 length prompts is used 3 times at three different levels, with negitives this gets very complicated far more then CLIP
Note: Everything beyond this point is AI written based on the diffusers code, COMFY may do some of its own shenanigans on QWEN.
The pipeline_flux2_klein.py implementation represents a shift away from traditional diffusion pipelines that rely on CLIP-style text encoders and UNet backbones. Instead, it uses a large language model (Qwen3) as a multi-layer feature extractor and converts both text and image latents into a unified sequence format processed by a transformer.
At a high level, the pipeline works by transforming both the prompt and the image latents into structured token sequences, enriching them with positional metadata, and feeding them into a model that operates over all tokens jointly.
Qwen3 as a Text Encoder (Not a Generator)
The pipeline uses Qwen3 strictly as a forward-pass encoder, not as a text generator. Instead of calling .generate(), it runs a single forward pass and explicitly requests hidden states:
output = text_encoder(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
use_cache=False,
)This disables autoregressive behavior (use_cache=False) and turns the model into a pure feature extractor. The pipeline does not use logits or decoded textâonly internal representations.
Prompts Are Converted into Chat Format
Before tokenization, prompts are wrapped using a chat template:
messages = [{"role": "user", "content": single_prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)This means the model does not see raw text. Instead, it encodes something structurally closer to:
User: <prompt>
Assistant:This adds extra tokens and context, aligning the input with how Qwen was trained.
Fixed-Length Token Sequences
All prompts are tokenized to a fixed length:
inputs = tokenizer(
text,
padding="max_length",
truncation=True,
max_length=512,
)This enforces a sequence length of 512 tokens, regardless of prompt length. Longer prompts are truncated, and shorter ones are padded. As a result, all text embeddings have shape:
[B, 512, ...]Multi-Layer Feature Extraction from Qwen
Rather than using only the final layer, the pipeline extracts hidden states from multiple intermediate layers:
out = torch.stack(
[output.hidden_states[k] for k in (9, 18, 27)], dim=1
)Each selected layer has shape:
[B, 512, 4096]These are then rearranged and flattened:
out = out.permute(0, 2, 1, 3)
prompt_embeds = out.reshape(batch_size, seq_len, num_channels * hidden_dim)Resulting in:
[B, 512, 12288]Each token embedding is therefore a concatenation of representations from three different depths of the model, preserving both low-level and high-level linguistic features.
Latents Are Converted into Patch Tokens
The image latents are not processed as 2D feature maps. Instead, they are patchified:
latents = latents.view(B, C, H//2, 2, W//2, 2)
latents = latents.permute(0, 1, 3, 5, 2, 4)
latents = latents.reshape(B, C * 4, H//2, W//2)This converts each 2Ă2 spatial region into a higher-dimensional channel representation. The latents are then flattened into a sequence:
latents = latents.reshape(batch_size, num_channels, height * width).permute(0, 2, 1)Resulting in:
[B, H*W, C]So instead of a grid, the image becomes a list of patch tokens, similar to Vision Transformers.
A Unified Coordinate System for All Tokens
One of the most distinctive features of this pipeline is the use of a shared coordinate system across all modalities. Tokensâwhether from text or imageâare assigned coordinates using:
coords = torch.cartesian_prod(t, h, w, l)This produces 4D indices:
(T, H, W, L)Text tokens vary along L (sequence position)
Image tokens vary along H, W (spatial position)
Multiple images vary along T (temporal index)
This allows all tokens to coexist in a single structured space, even though they originate from different modalities.
Everything Becomes a Sequence
After processing:
Text embeddings:
[B, 512, 12288]Image latents:
[B, H*W, C]
Both are treated as sequences of tokens. Combined with their coordinate metadata, they can be processed together by a transformer that operates over the entire set.
Key Architectural Insight
This pipeline does not follow the standard pattern of:
Text encoder â pooled embedding â cross-attention â UNetInstead, it implements something closer to:
Text tokens (multi-layer LLM features)
+ Image tokens (patchified latents)
+ Coordinates (T, H, W, L)
â Unified transformer processing all tokens jointlyThis design removes the distinction between âconditioningâ and âgeneration.â Text is no longer compressed into a single vector or used only in cross-attentionâit exists alongside image tokens as part of the same sequence.
Conclusion
The FLUX2 Klein pipeline fundamentally rethinks how diffusion models incorporate language. By using Qwen3 as a multi-layer feature extractor, preserving full token sequences, converting images into patch tokens, and introducing a shared coordinate system, it creates a unified token space where text and images are processed together.
The result is a system that behaves less like a traditional diffusion model with conditioning, and more like a multimodal transformer operating over structured tokens.


