Multimodal AI Models

134 AI models that go beyond text-only interaction. These multimodal models can see images, generate visuals, or accept multiple input types like audio and video alongside text — enabling richer, more capable AI applications.

134

Multimodal Models

Providers

128

Vision Models

Image Output

Free

All Multimodal Models — Ranked by Score

134 models with vision, image output, or multi-input capabilities. Average score: 54.

#	Model	Provider	Score	Modality	$/1M In	$/1M Out
1	GPT-5.2 ProOpenAI	OpenAI	90	text+image+file->text	$21.00	$168.00
2	GPT-5 ProOpenAI	OpenAI	90	text+image+file->text	$15.00	$120.00
3	o3 ProOpenAI	OpenAI	82	text+image+file->text	$20.00	$80.00
4	Claude Opus 4.1Anthropic	Anthropic	81	text+image+file->text	$15.00	$75.00
5	o1-proOpenAI	OpenAI	77	text+image+file->text	$150.00	$600.00
6	Claude Opus 4Anthropic	Anthropic	76	text+image+file->text	$15.00	$75.00
7	o3 Deep ResearchOpenAI	OpenAI	74	text+image+file->text	$10.00	$40.00
8	Claude Opus 4.6Anthropic	Anthropic	71	text+image->text	$5.00	$25.00
9	Claude Opus 4.5Anthropic	Anthropic	70	text+image+file->text	$5.00	$25.00
10	Claude Sonnet 4.5Anthropic	Anthropic	69	text+image+file->text	$3.00	$15.00
11	Qwen3 VL 30B A3B ThinkingAlibaba	Alibaba	69	text+image->text	Free	Free
12	Qwen3 VL 235B A22B ThinkingAlibaba	Alibaba	69	text+image->text	Free	Free
13	GPT-5.2OpenAI	OpenAI	68	text+image+file->text	$1.75	$14.00
14	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	68	text+image+file+audio+video->text	$2.00	$12.00
15	Gemini 3.1 Pro PreviewGoogle	Google	68	text+image+file+audio+video->text	$2.00	$12.00
16	Gemini 3 Pro PreviewGoogle	Google	68	text+image+file+audio+video->text	$2.00	$12.00
17	Claude Sonnet 4.6Anthropic	Anthropic	68	text+image->text	$3.00	$15.00
18	GPT-5.1OpenAI	OpenAI	67	text+image+file->text	$1.25	$10.00
19	GPT-5.3-CodexOpenAI	OpenAI	67	text+image->text	$1.75	$14.00
20	GPT-5.2-CodexOpenAI	OpenAI	67	text+image->text	$1.75	$14.00
21	GPT-5OpenAI	OpenAI	67	text+image+file->text	$1.25	$10.00
22	Gemini 3 Flash PreviewGoogle	Google	66	text+image+file+audio+video->text	$0.50	$3.00
23	o4 Mini Deep ResearchOpenAI	OpenAI	66	text+image+file->text	$2.00	$8.00
24	GPT-5.1-Codex-MaxOpenAI	OpenAI	66	text+image->text	$1.25	$10.00
25	Gemini 3.1 Flash Lite PreviewGoogle	Google	66	text+image+file+audio+video->text	$0.25	$1.50
26	Gemini 2.5 ProGoogle	Google	66	text+image+file+audio+video->text	$1.25	$10.00
27	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	65	text+image+file+audio+video->text	$0.10	$0.40
28	o1OpenAI	OpenAI	65	text+image+file->text	$15.00	$60.00
29	GPT-5 MiniOpenAI	OpenAI	65	text+image+file->text	$0.25	$2.00
30	Gemini 2.5 Pro Preview 05-06Google	Google	64	text+image+file+audio+video->text	$1.25	$10.00
31	GPT-5 NanoOpenAI	OpenAI	64	text+image+file->text	$0.05	$0.40
32	Nemotron Nano 12B 2 VL (free)NVIDIA	NVIDIA	64	text+image+video->text	Free	Free
33	Gemini 2.5 Flash LiteGoogle	Google	64	text+image+file+audio+video->text	$0.10	$0.40
34	Grok 4.1 FastxAI	xAI	64	text+image->text	$0.20	$0.50
35	Grok 4 FastxAI	xAI	64	text+image->text	$0.20	$0.50
36	Gemini 2.5 FlashGoogle	Google	64	text+image+file+audio+video->text	$0.30	$2.50
37	Gemini 2.5 Pro Preview 06-05Google	Google	64	text+image+file+audio->text	$1.25	$10.00
38	Claude Haiku 4.5Anthropic	Anthropic	63	text+image->text	$1.00	$5.00
39	Claude Sonnet 4Anthropic	Anthropic	63	text+image+file->text	$3.00	$15.00
40	GPT-5.3 ChatOpenAI	OpenAI	62	text+image+file->text	$1.75	$14.00
41	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	62	text+image+video->text	$0.26	$1.56
42	GPT-5.2 ChatOpenAI	OpenAI	62	text+image+file->text	$1.75	$14.00
43	GPT-5.1-CodexOpenAI	OpenAI	62	text+image->text	$1.25	$10.00
44	GPT-5 CodexOpenAI	OpenAI	62	text+image->text	$1.25	$10.00
45	o3OpenAI	OpenAI	62	text+image+file->text	$2.00	$8.00
46	Qwen3.5-FlashAlibaba	Alibaba	62	text+image+video->text	$0.10	$0.40
47	o4 Mini HighOpenAI	OpenAI	61	text+image+file->text	$1.10	$4.40
48	o4 MiniOpenAI	OpenAI	61	text+image+file->text	$1.10	$4.40
49	GPT-5.1 ChatOpenAI	OpenAI	61	text+image+file->text	$1.25	$10.00
50	Seed-2.0-MiniByteDance	ByteDance	61	text+image+video->text	$0.10	$0.40
51	Qwen3.5-122B-A10BAlibaba	Alibaba	61	text+image+video->text	$0.26	$2.08
52	Qwen3.5 397B A17BAlibaba	Alibaba	61	text+image+video->text	$0.39	$2.34
53	Qwen3.5-35B-A3BAlibaba	Alibaba	61	text+image+video->text	$0.16	$1.30
54	Qwen3.5-27BAlibaba	Alibaba	61	text+image+video->text	$0.20	$1.56
55	Sonar Pro SearchPerplexity	Perplexity	61	text+image->text	$3.00	$15.00
56	Nova 2 LiteAmazon	Amazon	61	text+image+file+video->text	$0.30	$2.50
57	Seed 1.6ByteDance	ByteDance	60	text+image+video->text	$0.25	$2.00
58	Seed 1.6 FlashByteDance	ByteDance	60	text+image+video->text	$0.07	$0.30
59	GPT-5.1-Codex-MiniOpenAI	OpenAI	60	text+image->text	$0.25	$2.00
60	GPT-4.1OpenAI	OpenAI	59	text+image+file->text	$2.00	$8.00

Text + Image Input

(134)

Models that accept images alongside text prompts for visual understanding.

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.2 ProOpenAI	OpenAI	90	400K	$168.00
2	GPT-5 ProOpenAI	OpenAI	90	400K	$120.00
3	o3 ProOpenAI	OpenAI	82	200K	$80.00
4	Claude Opus 4.1Anthropic	Anthropic	81	200K	$75.00
5	o1-proOpenAI	OpenAI	77	200K	$600.00
6	Claude Opus 4Anthropic	Anthropic	76	200K	$75.00
7	o3 Deep ResearchOpenAI	OpenAI	74	200K	$40.00
8	Claude Opus 4.6Anthropic	Anthropic	71	1M	$25.00
9	Claude Opus 4.5Anthropic	Anthropic	70	200K	$25.00
10	Claude Sonnet 4.5Anthropic	Anthropic	69	1M	$15.00
11	Qwen3 VL 30B A3B ThinkingAlibaba	Alibaba	69	131K	Free
12	Qwen3 VL 235B A22B ThinkingAlibaba	Alibaba	69	131K	Free
13	GPT-5.2OpenAI	OpenAI	68	400K	$14.00
14	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	68	1.0M	$12.00
15	Gemini 3.1 Pro PreviewGoogle	Google	68	1.0M	$12.00
16	Gemini 3 Pro PreviewGoogle	Google	68	1.0M	$12.00
17	Claude Sonnet 4.6Anthropic	Anthropic	68	1M	$15.00
18	GPT-5.1OpenAI	OpenAI	67	400K	$10.00
19	GPT-5.3-CodexOpenAI	OpenAI	67	400K	$14.00
20	GPT-5.2-CodexOpenAI	OpenAI	67	400K	$14.00
21	GPT-5OpenAI	OpenAI	67	400K	$10.00
22	Gemini 3 Flash PreviewGoogle	Google	66	1.0M	$3.00
23	o4 Mini Deep ResearchOpenAI	OpenAI	66	200K	$8.00
24	GPT-5.1-Codex-MaxOpenAI	OpenAI	66	400K	$10.00
25	Gemini 3.1 Flash Lite PreviewGoogle	Google	66	1.0M	$1.50
26	Gemini 2.5 ProGoogle	Google	66	1.0M	$10.00
27	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	65	1.0M	$0.40
28	o1OpenAI	OpenAI	65	200K	$60.00
29	GPT-5 MiniOpenAI	OpenAI	65	400K	$2.00
30	Gemini 2.5 Pro Preview 05-06Google	Google	64	1.0M	$10.00

Image Output

(5)

Models that can generate or edit images from text or multimodal prompts.

#	Model	Provider	Score	Context	$/1M Out
1	Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google	Google	—	66K	$3.00
2	Nano Banana Pro (Gemini 3 Pro Image Preview)Google	Google	—	66K	$12.00
3	GPT-5 Image MiniOpenAI	OpenAI	—	400K	$2.00
4	GPT-5 ImageOpenAI	OpenAI	—	400K	$10.00
5	Nano Banana (Gemini 2.5 Flash Image)Google	Google	—	33K	$2.50

Multi-Input (3+ Modalities)

(65)

Models accepting three or more input types such as text, image, audio, and video.

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.2 ProOpenAI	OpenAI	90	400K	$168.00
2	GPT-5 ProOpenAI	OpenAI	90	400K	$120.00
3	o3 ProOpenAI	OpenAI	82	200K	$80.00
4	Claude Opus 4.1Anthropic	Anthropic	81	200K	$75.00
5	o1-proOpenAI	OpenAI	77	200K	$600.00
6	Claude Opus 4Anthropic	Anthropic	76	200K	$75.00
7	o3 Deep ResearchOpenAI	OpenAI	74	200K	$40.00
8	Claude Opus 4.5Anthropic	Anthropic	70	200K	$25.00
9	Claude Sonnet 4.5Anthropic	Anthropic	69	1M	$15.00
10	GPT-5.2OpenAI	OpenAI	68	400K	$14.00
11	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	68	1.0M	$12.00
12	Gemini 3.1 Pro PreviewGoogle	Google	68	1.0M	$12.00
13	Gemini 3 Pro PreviewGoogle	Google	68	1.0M	$12.00
14	GPT-5.1OpenAI	OpenAI	67	400K	$10.00
15	GPT-5OpenAI	OpenAI	67	400K	$10.00
16	Gemini 3 Flash PreviewGoogle	Google	66	1.0M	$3.00
17	o4 Mini Deep ResearchOpenAI	OpenAI	66	200K	$8.00
18	Gemini 3.1 Flash Lite PreviewGoogle	Google	66	1.0M	$1.50
19	Gemini 2.5 ProGoogle	Google	66	1.0M	$10.00
20	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	65	1.0M	$0.40
21	o1OpenAI	OpenAI	65	200K	$60.00
22	GPT-5 MiniOpenAI	OpenAI	65	400K	$2.00
23	Gemini 2.5 Pro Preview 05-06Google	Google	64	1.0M	$10.00
24	GPT-5 NanoOpenAI	OpenAI	64	400K	$0.40
25	Nemotron Nano 12B 2 VL (free)NVIDIA	NVIDIA	64	128K	Free
26	Gemini 2.5 Flash LiteGoogle	Google	64	1.0M	$0.40
27	Gemini 2.5 FlashGoogle	Google	64	1.0M	$2.50
28	Gemini 2.5 Pro Preview 06-05Google	Google	64	1.0M	$10.00
29	Claude Sonnet 4Anthropic	Anthropic	63	1M	$15.00
30	GPT-5.3 ChatOpenAI	OpenAI	62	128K	$14.00

What Are Multimodal AI Models?

Beyond Text-Only

Multimodal AI models can process and generate more than just text. They understand images, diagrams, screenshots, and in some cases audio or video. This lets you build applications that interact with the world the way humans do — through multiple senses.

Vision (Image Input)

Vision-capable models accept images alongside text prompts. They can describe photos, extract text via OCR, analyze charts, review UI designs, and answer questions about visual content. Most frontier models now include vision as a core capability.

Image Generation (Image Output)

Some models can generate new images from text descriptions or edit existing ones. These range from dedicated image generators to unified models that handle both text and image output in a single conversation, like GPT-4o with image generation.

Use Cases

Document analysis and OCR, screenshot-to-code, chart interpretation, medical imaging, accessibility descriptions, visual QA, creative image generation, diagram-to-code conversion, and agentic workflows that require visual understanding.

Explore more model capabilities, rankings, and head-to-head comparisons.

Vision Models Reasoning Models Compare Models Full Leaderboard