Updated: Dec 8, 2025
base modelImg2Img Inpainting Workflow for Object Removal or Replacement Using Z-Image Turbo
This workflow is designed for ComfyUI users looking to perform img2img inpainting tasks, specifically to mask specific objects in an image and either remove them (by generating background fill) or replace them (by guiding the generation with a prompt). It leverages the Z-Image Turbo model for efficient, high-quality results. The process starts with loading an input image and a mask that covers the area you want to modify—remember, you must manually mask the exact area to be changed using ComfyUI's masking tools or external editors. Additionally, adjust the denoise strength in the KSampler node (typically between 0.4-0.7) based on how much change you want: lower values preserve more of the original structure, while higher ones allow for more creative replacement. Always test with your specific image for best results.
Two versions are provided: Mode 1 (a straightforward, basic setup) and Mode 3 (an enhanced version with optimizations for better blending and control). Both use the same core models (Z-Image Turbo UNET, CLIP text encoder, and VAE), but Mode 3 introduces specialized nodes for improved handling of masked areas, especially in complex scenes.
Mode 1: Basic Inpainting Setup
This is the simpler version, ideal for quick tests or users new to inpainting. It focuses on direct latent space manipulation without advanced cropping or blending.
Key Steps and Nodes:
Load your input image and mask via the LoadImage node (e.g., an example image like "Z-Image Turbo_00069_.png" with its associated mask).
Load the models: UNETLoader for "z_image_turbo_bf16.safetensors" (the diffusion model), CLIPLoader for "zImage_textEncoder.safetensors" (text encoding), and VAELoader for "zImage_vae.safetensors" (encoding/decoding latents).
Encode the image to latent space using VAEEncode.
Create positive conditioning with CLIPTextEncode (enter your prompt here, e.g., describing the replacement or background like "seamless forest background").
Zero out a copy of the positive conditioning with ConditioningZeroOut to create a neutral negative conditioning (this helps avoid unwanted artifacts).
Apply the mask to the latent with SetLatentNoiseMask, ensuring noise is only generated in the masked area.
Sample the latent using KSampler (configured with euler sampler, simple scheduler, 9 steps, CFG 1, and denoise around 0.66 for moderate changes—randomize seed for variations).
Decode the result with VAEDecode and preview the output image.
Strengths: Fast and minimalistic, great for simple removals like erasing small objects. It relies on standard ComfyUI inpainting mechanics, producing clean results with low step counts.
Limitations: May struggle with edge blending in larger masks or complex images, potentially leading to visible seams or inconsistencies around the masked boundaries.
Mode 3: Advanced Inpainting with Cropping, Stitching, and Diffusion Enhancements
This version builds on Mode 1 but incorporates significant upgrades for more professional results, especially when dealing with larger masks or outpainting-like extensions. It's optimized for better context awareness and seamless integration of the inpainted area back into the original image.
Key Steps and Nodes (Building on Mode 1's Foundation):
Similar model loading: UNETLoader, CLIPLoader, and VAELoader as in Mode 1.
LoadImage for input image and mask (e.g., "ComfyUI_00326_.png").
Apply DifferentialDiffusion to the model with strength 1—this enhances the diffusion process for more natural gradients and reduces artifacts in masked regions.
Positive conditioning via CLIPTextEncode (prompt for the desired change), with ConditioningZeroOut for negative.
Use InpaintCropImproved to intelligently crop the masked area: This node resizes/rescales the crop (e.g., bilinear downscale, bicubic upscale), fills mask holes, expands/blends the mask (e.g., 0 pixels expand, 32 pixels blend), applies a high-pass filter (0.1), and extends the context (factors like 1.2 for mask extension). It also supports pre-resizing to ensure minimum resolutions (e.g., 1024x1024) and output padding (e.g., 32 pixels). This creates a "stitcher" output for later blending.
Condition the model with InpaintModelConditioning: Combines positive/negative conditioning, VAE, cropped pixels, and mask (with noise_mask enabled for targeted noise).
Sample with KSampler (similar to Mode 1 but lower denoise like 0.44 for subtler changes, still 9 steps, euler/simple).
Decode with VAEDecode.
Stitch the inpainted crop back into the original using InpaintStitchImproved for seamless blending.
Preview the final image.
Strengths: Produces superior edge blending and context matching, making it better for replacing objects in detailed scenes (e.g., removing a person from a crowd and filling with coherent background). The cropping and stitching handle resolution mismatches and outpainting extensions effectively.
Key Differences Between Mode 1 and Mode 3
Core Approach: Mode 1 uses basic latent masking (SetLatentNoiseMask) for direct inpainting, while Mode 3 employs specialized inpainting conditioning (InpaintModelConditioning) combined with pre-processing (InpaintCropImproved) and post-processing (InpaintStitchImproved). This makes Mode 3 more robust for irregular or large masks.
Enhancements in Mode 3:
Cropping and Pre-Processing: InpaintCropImproved adds intelligent mask refinement (hole filling, expansion, blending, high-pass filtering) and contextual extension (e.g., extending boundaries by factors like 1-1.2), which prevents "cut-off" artifacts and improves generation quality—absent in Mode 1.
Diffusion Control: DifferentialDiffusion node modifies the model for smoother, more controlled diffusion in masked areas, leading to fewer hallucinations or mismatches compared to Mode 1's standard sampling.
Stitching and Blending: InpaintStitchImproved ensures the inpainted section blends perfectly back into the full image, addressing potential seams that Mode 1 might leave.
Denoise and Flexibility: Mode 3 uses a lower default denoise (0.44 vs. 0.66 in Mode 1), allowing for more preservation of original details while still enabling changes—adjust based on your needs.
Complexity and Output: Mode 3 is slightly more node-heavy but yields higher-fidelity results, especially for replacement tasks. It's better for advanced users but can handle more scenarios without manual tweaks.
In both modes, success depends on a precise mask—use ComfyUI's brush tools to cover only the object/area you want to modify, avoiding overlap with unchanged parts. Set your prompt in CLIPTextEncode to guide the replacement (e.g., "empty space" for removal or "red car" for swap). Experiment with denoise: start low (0.4) for subtle removals and increase for bold changes. These workflows are tailored for the Z-Image Turbo model, so ensure you have the files downloaded.
For a deeper comparison:
AspectMode 1 (Basic)Mode 3 (Advanced)Inpainting MethodDirect latent masking with SetLatentNoiseMaskSpecialized conditioning with InpaintModelConditioning + crop/stitchMask HandlingBasic applicationAdvanced refinement (expand, blend, filter, extend) via InpaintCropImprovedDiffusion EnhancerNoneDifferentialDiffusion for better controlBlendingRelies on samplerDedicated InpaintStitchImproved for seamless integrationDefault Denoise0.66 (more transformative)0.44 (more preservative)Best ForQuick, simple removalsComplex replacements with better edgesNode CountFewer (simpler graph)More (for added features)
