Track any subject with pixel-perfect accuracy for stunning VFX results.
Who it's for: creators who want this pipeline in ComfyUI without assembling nodes from scratch. Not for: one-click results with zero tuning — you still choose inputs, prompts, and settings.
Open preloaded workflow on RunComfy
Open preloaded workflow on RunComfy (browser)
Why RunComfy first
- Fewer missing-node surprises — run the graph in a managed environment before you mirror it locally.
- Quick GPU tryout — useful if your local VRAM or install time is the bottleneck.
- Matches the published JSON — the zip follows the same runnable workflow you can open on RunComfy.
When downloading for local ComfyUI makes sense — you want full control over models on disk, batch scripting, or offline runs.
How to use (local ComfyUI)
1. Load inputs (images/video/audio) in the marked loader nodes.
2. Set prompts, resolution, and seeds; start with a short test run.
3. Export from the Save / Write nodes shown in the graph.
Expectations — First run may pull large weights; cloud runs may require a free RunComfy account.
Overview
This workflow helps you isolate and track objects across video frames with pixel-level accuracy. It allows you to generate clean, consistent masks and motion data for compositing and advanced VFX tasks. Whether you need character isolation, background cleanup, or targeted edits, it provides reliable tracking every time. You can guide the process with text prompts or visual references. Perfect for creators seeking accurate, frame-consistent segmentation for visual effects and AI-driven editing.
Important nodes:
Key nodes in Comfyui ComfyUI Grounding workflow
GroundingDetector (#1)
Core detector turning your text prompt into bounding boxes. Raise the score threshold for fewer false positives; lower it if the target is small or partially occluded. Keep prompts short and specific, for example “red umbrella” rather than long sentences. Use this node to drive both segmentation and visualization stages downstream.
Sam2Segment (#11)
Refines coarse boxes into crisp masks using SAM 2. Feed it boxes from GroundingDetector; add a few positive or negative points only when the boundary needs extra guidance. If the subject and background flip, pair with InvertMask for the intended cutout. Use the result wherever an alpha matte is required.
GroundingMaskDetector (#22)
Generates a semantic mask directly from a natural-language instruction. This is best when you want a one-click selection without assembling a detection-to-segmentation chain. Tighten the text and increase confidence if multiple regions are being picked up; broaden the wording to include variations when the subject is missed.
JoinImageWithAlpha (#14)
Composites the original image with the mask into an RGBA output for downstream editors. Use it when you need transparent backgrounds, selective effects, or layered comp work. Combine with InvertMask to switch between isolating the subject and cutting the subject out.
VHS_LoadVideo (#32)
Splits a video into frames and extracts audio for processing. If your source has a variable frame rate, rely on the loaded frame rate it reports to keep timing consistent. This node is the entry point for any frame-by-frame detection or segmentation across a clip.
VHS_VideoCombine (#39)
Re-encodes processed frames into an MP4 while preserving audio. Match the frame rate to the value reported upstream to avoid time drift. Use the filename prefix to keep different runs organized in your output folder.
Notes
ComfyUI Grounding Workflow | Accurate Object Tracking & Segmentation — see RunComfy page for the latest node requirements.

