Auto-Generate Burned-In Subtitles for Any Video with Whisper in ComfyUI

You have a video. You need subtitles. Typing them manually takes forever. Syncing them in an editor takes even longer.

Upload once. Get your video back with subtitles already baked in.

Video in. Subtitled video out. No transcription, no syncing, no editor.

Run it now on Floyo!

How It Works

Upload your video. Whisper's large model extracts the audio and transcribes it with word-level timing, each word appears on screen exactly when it's spoken. The workflow burns the text directly onto the frames and exports a clean MP4 with subtitles embedded.

No SRT files. No separate subtitle editor. No syncing step.

Language detection is automatic. If auto-detection picks the wrong language, you can override it manually.

Key Settings

Whisper Model: Default is large. Best accuracy across languages, accents, and background noise. Smaller models (medium, small, base, tiny) run faster but miss more words. Stick with large unless speed is the priority.

Language: Default is auto. Whisper detects the spoken language from the audio. Supports dozens of languages. Override manually if auto-detection gets it wrong.

Font Color: Default white. For bright or busy backgrounds try yellow or black for contrast. Accepts color names or hex codes.

Font Family: Default Roboto-Bold. Bold fonts read better as subtitles, especially at smaller sizes or over moving footage.

Font Size: Default 40.

720p: 28–36
1080p: 36–48
4K: 56–72

Position (X, Y, Center)

Center X is on by default: Subtitles stay horizontally centered. Y position controls vertical placement.

Classic bottom subtitles on 1080p: Y around 900
Classic bottom on 720p: Y around 600
Centered social captions: turn on both Center X and Center Y
Vertical 9:16 video near bottom: Y around 1600–1800

What This Is Great For

Social media content: TikTok, Reels, and Shorts reward captioned video. Most viewers watch without sound. Burned-in subtitles from one upload, no editing app required.

Talking head and interview edits: Whisper handles conversational speech well. Word-level timing means text appears naturally as each word is spoken, not in large chunks.

Multilingual content: Auto language detection handles foreign-language footage without changing any settings. For multilingual clips with multiple languages, set the dominant language manually for best accuracy.

AI-generated video finishing: Add subtitles to AI-generated talking head clips, voiceover videos, or lip-synced content before posting.

What to Watch Out For

Subtitles are burned into the pixels. There is no editable SRT file output. If the transcription has an error, check the Preview Text output before running the full export, then re-run with a corrected approach if needed.

Heavy background music, overlapping speakers, or strong accents reduce Whisper accuracy. The large model handles these better than smaller sizes, but it's not perfect. Review the preview text on complex audio before committing to the final render.

This workflow is not the right tool if you need an editable subtitle file for distribution or accessibility compliance. It's built for burned-in subtitles on a final video output.

For very long videos, processing time scales with length. Test on a short clip first to confirm your font, size, and position settings before running the full file.