100% Vanilla Anime-to-Realistic Image Transformer using Sam Anima

This is a vanilla ComfyUI workflow designed to convert anime, manga, and illustrated images into realistic photographs. It balances pose accuracy with a realistic aesthetic, leveraging a secondary pass to upscale and fix common translation issues (like color fading or oversized stylized eyes). Best of all, it uses ONLY standard ComfyUI nodes, no more having to download a million different weird node packs.

This is a modified version of my earlier workflow that used Florence-2. While florence 2 is faster and smaller than Qwen 3VL the nodes to use it are touchy and I actually broke mine after an update so I decided to change my workflow.

🚀 Why This Workflow?

The core motivation behind this setup is pose control. Illustrated and drawn models often possess a significantly wider variety of dynamic poses, framing options, and composition styles than models trained strictly on real-world photography. This workflow allows you to inherit those unique compositions and seamlessly translate them into a photographic space.

🛠️ Key Core Features

100% Vanilla Nodes: Built entirely using native ComfyUI nodes. If you encounter a missing node error, simply update your ComfyUI installation to the latest version.
Batch-Processing Ready: The Load Image(s) section utilizes a primitive node set to increment-wrap. It will step through your designated input folder from the lowest file to the highest, looping automatically if your batch count exceeds your folder capacity.
Two-Pass Architecture:
- Pass 1 (Image 1): Takes the initial anime artwork, handles the heavy style translation, and generates the baseline realistic figure.
- Pass 2 (Image 2: High-Res): Upscales the latent space using a model-driven upscale (RealESRGAN_x2plus), then applies a correction pass to restore lighting depth, add texture detail, and stabilize color accuracy.

⚙️ Critical Settings & Fine-Tuning

VERY IMPORTANT:

For best results add in the prompt for the character you want to generate into the "concatenate Text" node, after the quality modifiers (masterpiece, best quality, score_9, score_8, score_7, score_6, ((realistic photo)), proportionate head, bright colours) this will ensure all the details are captured correctly. So if you want to generate best girl, like in my title photo, you should use:

"masterpiece, best quality, score_9, score_8, score_7, score_6, ((realistic photograph)) of a caucasian girl, she has a proportionate face and eyes, she is cosplaying as hatsune miku holding a vanilla ice cream cone, she is smiling and looking at the viewer, she holds the ice cream cone in front of her face, she has aqua coloured eyes, and aqua nail polish"

To get the absolute best results out of your images, adjust these primary settings:

Denoise Strength (Image 1 KSampler): Default: 0.42. Depending on how highly stylized your input character is (e.g., massive eyes, exaggerated head sizes), you may need to swing this value between 0.36 and 0.63. If the stylized eyes look slightly uncanny on the first pass, don't worry—the subsequent high-res pass will help blend the realism.
Latent Multiplier (Image 2 Group): Default: 1.42. Img2Img style transfers often suffer from washed-out or fading colors as they shift between domains. Tweak this value using the dedicated primitive node during batch runs to maintain or punch up your image contrast and color vibrancy.
Pro-Tip for Stubborn Images: If a character's eyes remain stubbornly large or anime-like due to a highly stylized input, simply take the final output image and route it back through the workflow as the initial input for a clean second cycle, like in the image below.

🤖 Optional: VLM Prompt Enhancer

The workflow features an integrated Prompt Enhancer Subgraph powered by a vision-language model (Qwen3vl_4b_fp8_scaled).

What it does: It looks at your input illustration, analyzes what it sees (focusing deeply on background details, framing, facial expressions, and gaze direction), and appends a detailed structural description to your core quality prompt. This is incredibly useful for automated batch runs where you want rich, contextual background detail.
VRAM Configuration: It is bypassed by default to accommodate lower-VRAM configurations. If you have the hardware headroom and want to enable it:
1. Un-bypass the Prompt Enhance node group.
2. Toggle the primitive switch next to the user prompt to True.
3. Ensure the Concatenate Text Node remains active, as it functions as the structural bridge for the switch network.

🧩 Model Requirements

To run this out of the box, make sure you have these in your ComfyUI directories:

UNET Checkpoint: samANIMARealistic_turboV23.safetensors
LoRA: anima-turbo-lora-v0.2.safetensors
CLIP Text Encoder: qwen_3_06b_base.safetensors
Optional CLIP Text Encoder for Prompt Enhancer: qwen3vl_4b_fp8_scaled.safetensors
VAE: qwen_image_vae.safetensors
Upscale Model: RealESRGAN_x2plus.pth

Feel free to leave a comment if you run into any runtime configuration issues, and don't forget to share your generations below!