161 AI models that go beyond text-only interaction. These multimodal models can see images, generate visuals, or accept multiple input types like audio and video alongside text - enabling richer, more capable AI applications.
161 models with vision, image output, or multi-input capabilities. Average score: 65.
Models that accept images alongside text prompts for visual understanding.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.4 ProOpenAI | 94 |
| 2 | GPT-5.4OpenAI | 94 |
| 3 | GPT-5.4 MiniOpenAI | 93 |
| 4 | GPT-5.2 ProOpenAI | 93 |
| 5 | GPT-5.2OpenAI | 93 |
| 6 | Claude Opus 4.6Anthropic | 92 |
| 7 | GPT-5 ProOpenAI | 92 |
| 8 | o3 Deep ResearchOpenAI | 92 |
| 9 | Claude Opus 4.5Anthropic | 90 |
| 10 | Gemini 3 Pro PreviewGoogle | 90 |
| 11 | GPT-5OpenAI | 90 |
| 12 | Gemini 3 Flash PreviewGoogle | 89 |
| 13 | Claude Sonnet 4.6Anthropic | 89 |
| 14 | Claude Sonnet 4.5Anthropic | 89 |
| 15 | o3 ProOpenAI | 88 |
| 16 | Grok 4.1 FastxAI | 87 |
| 17 | Grok 4xAI | 86 |
| 18 | Grok 4.20 BetaxAI | 86 |
| 19 | o3OpenAI | 86 |
| 20 | Gemini 3.1 Pro PreviewGoogle | 86 |
| 21 | GPT-5.1OpenAI | 85 |
| 22 | MiMo-V2-OmniXiaomi | 85 |
| 23 | GPT-5.4 NanoOpenAI | 85 |
| 24 | Seed-2.0-LiteByteDance | 85 |
| 25 | GPT-5.3 ChatOpenAI | 85 |
| 26 | Seed-2.0-MiniByteDance | 85 |
| 27 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 85 |
| 28 | GPT-5.3-CodexOpenAI | 85 |
| 29 | Qwen3.5 Plus 2026-02-15Alibaba | 85 |
| 30 | Kimi K2.5Moonshot AI | 85 |
Models that can generate or edit images from text or multimodal prompts.
| # | Model | Score |
|---|---|---|
| 1 | Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google | — |
| 2 | Nano Banana Pro (Gemini 3 Pro Image Preview)Google | — |
| 3 | GPT-5 Image MiniOpenAI | — |
| 4 | GPT-5 ImageOpenAI | — |
| 5 | Nano Banana (Gemini 2.5 Flash Image)Google | — |
| 6 | Midjourney v6.1Midjourney | — |
| 7 | DALL-E 3OpenAI | — |
| 8 | Stable Diffusion 3.5Stability AI | — |
| 9 | FLUX.1 ProBlack Forest Labs | — |
| 10 | Ideogram 2.0Ideogram | — |
| 11 | Recraft V3Recraft | — |
| 12 | Imagen 3Google | — |
| 13 | Adobe Firefly 3Adobe | — |
| 14 | Leonardo PhoenixLeonardo AI | — |
Models accepting three or more input types such as text, image, audio, and video.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.4 ProOpenAI | 94 |
| 2 | GPT-5.4OpenAI | 94 |
| 3 | GPT-5.4 MiniOpenAI | 93 |
| 4 | GPT-5.2 ProOpenAI | 93 |
| 5 | GPT-5.2OpenAI | 93 |
| 6 | GPT-5 ProOpenAI | 92 |
| 7 | o3 Deep ResearchOpenAI | 92 |
| 8 | Claude Opus 4.5Anthropic | 90 |
| 9 | Gemini 3 Pro PreviewGoogle | 90 |
| 10 | GPT-5OpenAI | 90 |
| 11 | Gemini 3 Flash PreviewGoogle | 89 |
| 12 | Claude Sonnet 4.5Anthropic | 89 |
| 13 | o3 ProOpenAI | 88 |
| 14 | o3OpenAI | 86 |
| 15 | Gemini 3.1 Pro PreviewGoogle | 86 |
| 16 | GPT-5.1OpenAI | 85 |
| 17 | MiMo-V2-OmniXiaomi | 85 |
| 18 | GPT-5.4 NanoOpenAI | 85 |
| 19 | Seed-2.0-LiteByteDance | 85 |
| 20 | GPT-5.3 ChatOpenAI | 85 |
| 21 | Seed-2.0-MiniByteDance | 85 |
| 22 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 85 |
| 23 | GPT-5.3-CodexOpenAI | 85 |
| 24 | Qwen3.5 Plus 2026-02-15Alibaba | 85 |
| 25 | Seed 1.6 FlashByteDance | 85 |
| 26 | Seed 1.6ByteDance | 85 |
| 27 | GPT-5.1 ChatOpenAI | 85 |
| 28 | o4 Mini Deep ResearchOpenAI | 85 |
| 29 | o4 Mini HighOpenAI | 85 |
| 30 | Gemini 2.5 ProGoogle | 85 |
Multimodal AI models can process and generate more than just text. They understand images, diagrams, screenshots, and in some cases audio or video. This lets you build applications that interact with the world the way humans do - through multiple senses.
Vision-capable models accept images alongside text prompts. They can describe photos, extract text via OCR, analyze charts, review UI designs, and answer questions about visual content. Most frontier models now include vision as a core capability.
Some models can generate new images from text descriptions or edit existing ones. These range from dedicated image generators to unified models that handle both text and image output in a single conversation, like GPT-4o with image generation.
Document analysis and OCR, screenshot-to-code, chart interpretation, medical imaging, accessibility descriptions, visual QA, creative image generation, diagram-to-code conversion, and agentic workflows that require visual understanding.
Explore more model capabilities, rankings, and head-to-head comparisons.
Multimodal AI models can process multiple types of input - text, images, audio, and video - in a single interaction. They understand context across modalities, like analyzing a chart image while answering text questions about it.
GPT-4o accepts text, images, and audio, and can generate text, images, and speech. Gemini 2.0 handles text, images, video, and audio input. Claude 3.5 processes text and images with best-in-class document understanding.
Multimodal models enable applications that were previously impossible - analyzing medical images, understanding video content, processing documents with charts and tables, and building AI assistants that see and hear.