Generate Images with Correctly Spelled Text Built In Using Ovis in ComfyUI

You need a banner. A social post. A product shot with a headline on it. You generate it, the text comes out garbled, misspelled, or illegible.

Every image model has this problem. Ovis is built to fix it.

Describe the image and the text. Get both rendered correctly in one generation.

Run it now on Floyo!

Why Ovis

Most image models treat text as just another visual element. They approximate it. Letters get scrambled, words get merged, spelling is unreliable.

Ovis is a 7B model with a text-centric training recipe. It combines a multimodal backbone with a diffusion visual decoder specifically tuned to place words accurately inside images. Headlines, taglines, button labels, and product copy, spelled correctly, legibly rendered, positioned where you describe them. Works in English and Chinese.

correctly spelled text rendered inside the image
works for headlines, taglines, buttons, labels, and UI copy
English and Chinese language support
fast enough to run on a single GPU or cloud endpoint
benchmarks show word accuracy rivaling much larger models

How to Write Prompts

Ovis needs three things in the prompt to render text well: the scene, the exact text in quotes, and the layout direction.

Prompt structure that works:

Describe the scene: "minimal product shot of a skincare bottle on a soft beige background, soft shadows"
Specify exact text in quotes: "headline at top: 'Glow Serum', tagline below: 'Radiance in every drop'"
Add layout and style: "centered layout, clean sans-serif font, high contrast, 4:5 vertical"

Full examples:

"hero banner for a fitness app, dark background with energy lighting, bold headline: 'Train Harder', subheading: 'Your goals. Your pace.', centered, modern sans-serif font"
"minimal product label for a coffee bag, kraft paper texture, text: 'Dark Roast', smaller text below: 'Single Origin Ethiopia', clean typography"
"social media post, soft pastel background, motivational quote: 'Start before you're ready.', small attribution text: '@brandname', Instagram square format"
"UI mockup of a mobile app screen, clean white background, button labeled 'Get Started', navigation items: 'Home', 'Explore', 'Profile', realistic product screenshot style"

Always put the exact text you want rendered inside single quotes within your prompt. Ovis reads the quoted strings and places them accurately.

What This Is Great For

Marketing graphics and ad creatives: Hero banners, social posts, and ad creatives with headlines and CTAs baked directly into the image. No Photoshop typography work after generation.

UI and app mockups: Screens with legible buttons, menus, and panel titles that look like real product screenshots. Useful for pitch decks, design reviews, and rapid prototyping.

Posters and presentations: Titles, subtitles, and copy blocks integrated into a designed layout in one generation.

Product packaging and labels: Simple wordmarks, product labels, and stickers where spelling accuracy is non-negotiable.

A/B test variants: Generate multiple versions of the same layout with different headlines or CTAs. Only the text changes across generations, the design stays consistent.

What to Watch Out For

Always quote the exact text you want in your prompt. Describing text vaguely ("add a headline about skincare") gives the model too much latitude. Specifying exactly what to say ("headline: 'Pure Glow Serum'") produces accurate results.

Long paragraphs of body copy are harder to render cleanly than short headlines and taglines. Keep in-image text to headlines, short phrases, button labels, and brief copy. Ovis excels at those. Dense paragraph text is better added in a design tool after generation.

Complex custom typography (hand-lettered scripts, intricate decorative fonts) may not render as precisely as clean sans-serif or serif styles. Describe font style simply, "bold sans-serif," "elegant serif," "clean modern typeface."

Resolution caps at around 1024x768 depending on aspect ratio. For larger print-ready output, generate at the highest available resolution and upscale in a separate workflow.