134 AI models that go beyond text-only interaction. These multimodal models can see images, generate visuals, or accept multiple input types like audio and video alongside text — enabling richer, more capable AI applications.
134 models with vision, image output, or multi-input capabilities. Average score: 54.
Models that accept images alongside text prompts for visual understanding.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.2 ProOpenAI | 90 |
| 2 | GPT-5 ProOpenAI | 90 |
| 3 | o3 ProOpenAI | 82 |
| 4 | Claude Opus 4.1Anthropic | 81 |
| 5 | o1-proOpenAI | 77 |
| 6 | Claude Opus 4Anthropic | 76 |
| 7 | o3 Deep ResearchOpenAI | 74 |
| 8 | Claude Opus 4.6Anthropic | 71 |
| 9 | Claude Opus 4.5Anthropic | 70 |
| 10 | Claude Sonnet 4.5Anthropic | 69 |
| 11 | Qwen3 VL 30B A3B ThinkingAlibaba | 69 |
| 12 | Qwen3 VL 235B A22B ThinkingAlibaba | 69 |
| 13 | GPT-5.2OpenAI | 68 |
| 14 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 68 |
| 15 | Gemini 3.1 Pro PreviewGoogle | 68 |
| 16 | Gemini 3 Pro PreviewGoogle | 68 |
| 17 | Claude Sonnet 4.6Anthropic | 68 |
| 18 | GPT-5.1OpenAI | 67 |
| 19 | GPT-5.3-CodexOpenAI | 67 |
| 20 | GPT-5.2-CodexOpenAI | 67 |
| 21 | GPT-5OpenAI | 67 |
| 22 | Gemini 3 Flash PreviewGoogle | 66 |
| 23 | o4 Mini Deep ResearchOpenAI | 66 |
| 24 | GPT-5.1-Codex-MaxOpenAI | 66 |
| 25 | Gemini 3.1 Flash Lite PreviewGoogle | 66 |
| 26 | Gemini 2.5 ProGoogle | 66 |
| 27 | Gemini 2.5 Flash Lite Preview 09-2025Google | 65 |
| 28 | o1OpenAI | 65 |
| 29 | GPT-5 MiniOpenAI | 65 |
| 30 | Gemini 2.5 Pro Preview 05-06Google | 64 |
Models that can generate or edit images from text or multimodal prompts.
| # | Model | Score |
|---|---|---|
| 1 | Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google | — |
| 2 | Nano Banana Pro (Gemini 3 Pro Image Preview)Google | — |
| 3 | GPT-5 Image MiniOpenAI | — |
| 4 | GPT-5 ImageOpenAI | — |
| 5 | Nano Banana (Gemini 2.5 Flash Image)Google | — |
Models accepting three or more input types such as text, image, audio, and video.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.2 ProOpenAI | 90 |
| 2 | GPT-5 ProOpenAI | 90 |
| 3 | o3 ProOpenAI | 82 |
| 4 | Claude Opus 4.1Anthropic | 81 |
| 5 | o1-proOpenAI | 77 |
| 6 | Claude Opus 4Anthropic | 76 |
| 7 | o3 Deep ResearchOpenAI | 74 |
| 8 | Claude Opus 4.5Anthropic | 70 |
| 9 | Claude Sonnet 4.5Anthropic | 69 |
| 10 | GPT-5.2OpenAI | 68 |
| 11 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 68 |
| 12 | Gemini 3.1 Pro PreviewGoogle | 68 |
| 13 | Gemini 3 Pro PreviewGoogle | 68 |
| 14 | GPT-5.1OpenAI | 67 |
| 15 | GPT-5OpenAI | 67 |
| 16 | Gemini 3 Flash PreviewGoogle | 66 |
| 17 | o4 Mini Deep ResearchOpenAI | 66 |
| 18 | Gemini 3.1 Flash Lite PreviewGoogle | 66 |
| 19 | Gemini 2.5 ProGoogle | 66 |
| 20 | Gemini 2.5 Flash Lite Preview 09-2025Google | 65 |
| 21 | o1OpenAI | 65 |
| 22 | GPT-5 MiniOpenAI | 65 |
| 23 | Gemini 2.5 Pro Preview 05-06Google | 64 |
| 24 | GPT-5 NanoOpenAI | 64 |
| 25 | Nemotron Nano 12B 2 VL (free)NVIDIA | 64 |
| 26 | Gemini 2.5 Flash LiteGoogle | 64 |
| 27 | Gemini 2.5 FlashGoogle | 64 |
| 28 | Gemini 2.5 Pro Preview 06-05Google | 64 |
| 29 | Claude Sonnet 4Anthropic | 63 |
| 30 | GPT-5.3 ChatOpenAI | 62 |
Multimodal AI models can process and generate more than just text. They understand images, diagrams, screenshots, and in some cases audio or video. This lets you build applications that interact with the world the way humans do — through multiple senses.
Vision-capable models accept images alongside text prompts. They can describe photos, extract text via OCR, analyze charts, review UI designs, and answer questions about visual content. Most frontier models now include vision as a core capability.
Some models can generate new images from text descriptions or edit existing ones. These range from dedicated image generators to unified models that handle both text and image output in a single conversation, like GPT-4o with image generation.
Document analysis and OCR, screenshot-to-code, chart interpretation, medical imaging, accessibility descriptions, visual QA, creative image generation, diagram-to-code conversion, and agentic workflows that require visual understanding.
Explore more model capabilities, rankings, and head-to-head comparisons.