Update 5/6 9:20pm UTC: OpenAI are now rate limiting us due to huge traffic spike! We've had to remove the model from generation temporarily and it will be re-enabled shortly!
Originally detailed at - https://openai.com/index/introducing-4o-image-generation/
Check our our Guide to using GPT Image 1!
Useful image generation
From the first cave paintings to modern infographics, humans have used visual imagery to communicate, persuade, and analyze—not just to decorate. Today's generative models can conjure surreal, breathtaking scenes, but struggle with the workhorse imagery people use to share and create information. From logos to diagrams, images can convey precise meaning when augmented with symbols that refer to shared language and experience.
GPT‑4o image generation excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherent knowledge base and chat context—including transforming uploaded images or using them as visual inspiration. These capabilities make it easier to create exactly the image you envision, helping you communicate more effectively through visuals and advancing image generation into a practical tool with precision and power.
Improved capabilities
We trained our models on the joint distribution of online images and text, learning not just how images relate to language, but how they relate to each other. Combined with aggressive post-training, the resulting model has surprising visual fluency, capable of generating images that are useful, consistent, and context-aware.
Text rendering
A picture is worth a thousand words, but sometimes generating a few words in the right place can elevate the meaning of an image. 4o’s ability to blend precise symbols with imagery turns image generation into a tool for visual communication.
Multi-turn generation
Because image generation is now native to GPT‑4o, you can refine images through natural conversation. GPT‑4o can build upon images and text in chat context, ensuring consistency throughout. For example, if you’re designing a video game character, the character’s appearance remains coherent across multiple iterations as you refine and experiment.
Instruction following
GPT‑4o’s image generation follows detailed prompts with attention to detail. While other systems struggle with ~5-8 objects, GPT‑4o can handle up to 10-20 different objects. The tighter binding of objects to their traits and relations allows for better control.
In-context learning
GPT‑4o can analyze and learn from user-uploaded images, seamlessly integrating their details into its context to inform image generation.
Safety
In line with our Model Spec, we aim to maximize creative freedom by supporting valuable use cases like game development, historical exploration, and education—while maintaining strong safety standards. At the same time, it remains as important as ever to block requests that violate those standards. Below are evaluations of additional risk areas where we're working to enable safe, high-utility content and support broader creative expression for users.
Provenance via C2PA and internal reversible search
All generated images come with C2PA metadata, which will identify an image as coming from GPT‑4o, to provide transparency. We’ve also built an internal search tool that uses technical attributes of generations to help verify if content came from our model.
Blocking the bad stuff
We’re continuing to block requests for generated images that may violate our content policies, such as child sexual abuse materials and sexual deepfakes. When images of real people are in context, we have heightened restrictions regarding what kind of imagery can be created, with particularly robust safeguards around nudity and graphic violence. As with any launch, safety is never finished and is rather an ongoing area of investment. As we learn more about real-world use of this model, we’ll adjust our policies accordingly.
For more on our approach, visit the image generation addendum to the GPT‑4o system card.
Using reasoning to power safety
Similar to our deliberative alignment work, we’ve trained a reasoning LLM to work directly from human-written and interpretable safety specifications. We used this reasoning LLM during development to help us identify and address ambiguities in our policies. Together with our multimodal advancements and existing safety techniques developed for ChatGPT and Sora, this allows us to moderate both input text and output images against our policies.