Multimodal AI Models

161 AI models that go beyond text-only interaction. These multimodal models can see images, generate visuals, or accept multiple input types like audio and video alongside text - enabling richer, more capable AI applications.

161

Multimodal Models

Providers

138

Vision Models

Image Output

Free

All Multimodal Models - Ranked by Score

161 models with vision, image output, or multi-input capabilities. Average score: 65.

#	Model	Provider	Score	Modality	$/1M In	$/1M Out
1	GPT-5.4 ProOpenAI	OpenAI	94	text+image+file->text	$30.00	$180.00
2	GPT-5.4OpenAI	OpenAI	94	text+image+file->text	$2.50	$15.00
3	GPT-5.4 MiniOpenAI	OpenAI	93	text+image+file->text	$0.75	$4.50
4	GPT-5.2 ProOpenAI	OpenAI	93	text+image+file->text	$21.00	$168.00
5	GPT-5.2OpenAI	OpenAI	93	text+image+file->text	$1.75	$14.00
6	Claude Opus 4.6Anthropic	Anthropic	92	text+image->text	$5.00	$25.00
7	GPT-5 ProOpenAI	OpenAI	92	text+image+file->text	$15.00	$120.00
8	o3 Deep ResearchOpenAI	OpenAI	92	text+image+file->text	$10.00	$40.00
9	Claude Opus 4.5Anthropic	Anthropic	90	text+image+file->text	$5.00	$25.00
10	Gemini 3 Pro PreviewGoogle	Google	90	text+image+file+audio+video->text	$2.00	$12.00
11	GPT-5OpenAI	OpenAI	90	text+image+file->text	$1.25	$10.00
12	Gemini 3 Flash PreviewGoogle	Google	89	text+image+file+audio+video->text	$0.50	$3.00
13	Claude Sonnet 4.6Anthropic	Anthropic	89	text+image->text	$3.00	$15.00
14	Claude Sonnet 4.5Anthropic	Anthropic	89	text+image+file->text	$3.00	$15.00
15	o3 ProOpenAI	OpenAI	88	text+image+file->text	$20.00	$80.00
16	Grok 4.1 FastxAI	xAI	87	text+image->text	$0.20	$0.50
17	Grok 4xAI	xAI	86	text+image->text	$3.00	$15.00
18	Grok 4.20 BetaxAI	xAI	86	text+image->text	$2.00	$6.00
19	o3OpenAI	OpenAI	86	text+image+file->text	$2.00	$8.00
20	Gemini 3.1 Pro PreviewGoogle	Google	86	text+image+file+audio+video->text	$2.00	$12.00
21	GPT-5.1OpenAI	OpenAI	85	text+image+file->text	$1.25	$10.00
22	MiMo-V2-OmniXiaomi	Xiaomi	85	text+image+audio+video->text	$0.40	$2.00
23	GPT-5.4 NanoOpenAI	OpenAI	85	text+image+file->text	$0.20	$1.25
24	Seed-2.0-LiteByteDance	ByteDance	85	text+image+video->text	$0.25	$2.00
25	GPT-5.3 ChatOpenAI	OpenAI	85	text+image+file->text	$1.75	$14.00
26	Seed-2.0-MiniByteDance	ByteDance	85	text+image+video->text	$0.10	$0.40
27	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	85	text+image+file+audio+video->text	$2.00	$12.00
28	GPT-5.3-CodexOpenAI	OpenAI	85	text+image+file->text	$1.75	$14.00
29	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	85	text+image+video->text	$0.26	$1.56
30	Kimi K2.5Moonshot AI	Moonshot AI	85	text+image->text	$0.45	$2.20
31	GPT-5.2-CodexOpenAI	OpenAI	85	text+image->text	$1.75	$14.00
32	Seed 1.6 FlashByteDance	ByteDance	85	text+image+video->text	$0.07	$0.30
33	Seed 1.6ByteDance	ByteDance	85	text+image+video->text	$0.25	$2.00
34	GPT-5.1-Codex-MaxOpenAI	OpenAI	85	text+image->text	$1.25	$10.00
35	GPT-5.1 ChatOpenAI	OpenAI	85	text+image+file->text	$1.25	$10.00
36	GPT-5.1-CodexOpenAI	OpenAI	85	text+image->text	$1.25	$10.00
37	GPT-5.1-Codex-MiniOpenAI	OpenAI	85	text+image->text	$0.25	$2.00
38	Sonar Pro SearchPerplexity	Perplexity	85	text+image->text	$3.00	$15.00
39	Qwen3 VL 8B ThinkingAlibaba	Alibaba	85	text+image->text	$0.12	$1.36
40	o4 Mini Deep ResearchOpenAI	OpenAI	85	text+image+file->text	$2.00	$8.00
41	Qwen3 VL 30B A3B ThinkingAlibaba	Alibaba	85	text+image->text	$0.13	$1.56
42	GPT-5 CodexOpenAI	OpenAI	85	text+image->text	$1.25	$10.00
43	o4 Mini HighOpenAI	OpenAI	85	text+image+file->text	$1.10	$4.40
44	Gemini 2.5 ProGoogle	Google	85	text+image+file+audio+video->text	$1.25	$10.00
45	Gemini 2.5 Pro Preview 06-05Google	Google	84	text+image+file+audio->text	$1.25	$10.00
46	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	84	text+image+file+audio+video->text	$0.10	$0.40
47	o4 MiniOpenAI	OpenAI	84	text+image+file->text	$1.10	$4.40
48	Grok 4 FastxAI	xAI	83	text+image->text	$0.20	$0.50
49	Claude Haiku 4.5Anthropic	Anthropic	83	text+image->text	$1.00	$5.00
50	GPT-5.2 ChatOpenAI	OpenAI	83	text+image+file->text	$1.75	$14.00
51	Gemini 2.5 Pro Preview 05-06Google	Google	83	text+image+file+audio+video->text	$1.25	$10.00
52	Nemotron Nano 12B 2 VL (free)NVIDIA	NVIDIA	82	text+image+video->text	Free	Free
53	Grok 4.20 Multi-Agent BetaxAI	xAI	82	text+image->text	$2.00	$6.00
54	Claude Opus 4.1Anthropic	Anthropic	82	text+image+file->text	$15.00	$75.00
55	Gemini 3.1 Flash Lite PreviewGoogle	Google	82	text+image+file+audio+video->text	$0.25	$1.50
56	Qwen3.5 397B A17BAlibaba	Alibaba	82	text+image+video->text	$0.39	$2.34
57	Claude Opus 4Anthropic	Anthropic	82	text+image+file->text	$15.00	$75.00
58	Gemini 2.5 Flash LiteGoogle	Google	81	text+image+file+audio+video->text	$0.10	$0.40
59	Qwen3 VL 32B InstructAlibaba	Alibaba	81	text+image->text	$0.10	$0.42
60	Qwen3 VL 8B InstructAlibaba	Alibaba	81	text+image->text	$0.08	$0.50

Text + Image Input

(156)

Models that accept images alongside text prompts for visual understanding.

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.4 ProOpenAI	OpenAI	94	1.1M	$180.00
2	GPT-5.4OpenAI	OpenAI	94	1.1M	$15.00
3	GPT-5.4 MiniOpenAI	OpenAI	93	400K	$4.50
4	GPT-5.2 ProOpenAI	OpenAI	93	400K	$168.00
5	GPT-5.2OpenAI	OpenAI	93	400K	$14.00
6	Claude Opus 4.6Anthropic	Anthropic	92	1M	$25.00
7	GPT-5 ProOpenAI	OpenAI	92	400K	$120.00
8	o3 Deep ResearchOpenAI	OpenAI	92	200K	$40.00
9	Claude Opus 4.5Anthropic	Anthropic	90	200K	$25.00
10	Gemini 3 Pro PreviewGoogle	Google	90	1.0M	$12.00
11	GPT-5OpenAI	OpenAI	90	400K	$10.00
12	Gemini 3 Flash PreviewGoogle	Google	89	1.0M	$3.00
13	Claude Sonnet 4.6Anthropic	Anthropic	89	1M	$15.00
14	Claude Sonnet 4.5Anthropic	Anthropic	89	1M	$15.00
15	o3 ProOpenAI	OpenAI	88	200K	$80.00
16	Grok 4.1 FastxAI	xAI	87	2M	$0.50
17	Grok 4xAI	xAI	86	256K	$15.00
18	Grok 4.20 BetaxAI	xAI	86	2M	$6.00
19	o3OpenAI	OpenAI	86	200K	$8.00
20	Gemini 3.1 Pro PreviewGoogle	Google	86	1.0M	$12.00
21	GPT-5.1OpenAI	OpenAI	85	400K	$10.00
22	MiMo-V2-OmniXiaomi	Xiaomi	85	262K	$2.00
23	GPT-5.4 NanoOpenAI	OpenAI	85	400K	$1.25
24	Seed-2.0-LiteByteDance	ByteDance	85	262K	$2.00
25	GPT-5.3 ChatOpenAI	OpenAI	85	128K	$14.00
26	Seed-2.0-MiniByteDance	ByteDance	85	262K	$0.40
27	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	85	1.0M	$12.00
28	GPT-5.3-CodexOpenAI	OpenAI	85	400K	$14.00
29	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	85	1M	$1.56
30	Kimi K2.5Moonshot AI	Moonshot AI	85	262K	$2.20

Image Output

(14)

Models that can generate or edit images from text or multimodal prompts.

#	Model	Provider	Score	Context	$/1M Out
1	Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google	Google	—	66K	$3.00
2	Nano Banana Pro (Gemini 3 Pro Image Preview)Google	Google	—	66K	$12.00
3	GPT-5 Image MiniOpenAI	OpenAI	—	400K	$2.00
4	GPT-5 ImageOpenAI	OpenAI	—	400K	$10.00
5	Nano Banana (Gemini 2.5 Flash Image)Google	Google	—	33K	$2.50
6	Midjourney v6.1Midjourney	Midjourney	—	0	Free
7	DALL-E 3OpenAI	OpenAI	—	0	$40000.00
8	Stable Diffusion 3.5Stability AI	Stability AI	—	0	$35000.00
9	FLUX.1 ProBlack Forest Labs	Black Forest Labs	—	0	$50000.00
10	Ideogram 2.0Ideogram	Ideogram	—	0	$80000.00
11	Recraft V3Recraft	Recraft	—	0	$40000.00
12	Imagen 3Google	Google	—	0	$40000.00
13	Adobe Firefly 3Adobe	Adobe	—	0	Free
14	Leonardo PhoenixLeonardo AI	Leonardo AI	—	0	Free

Multi-Input (3+ Modalities)

(72)

Models accepting three or more input types such as text, image, audio, and video.

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.4 ProOpenAI	OpenAI	94	1.1M	$180.00
2	GPT-5.4OpenAI	OpenAI	94	1.1M	$15.00
3	GPT-5.4 MiniOpenAI	OpenAI	93	400K	$4.50
4	GPT-5.2 ProOpenAI	OpenAI	93	400K	$168.00
5	GPT-5.2OpenAI	OpenAI	93	400K	$14.00
6	GPT-5 ProOpenAI	OpenAI	92	400K	$120.00
7	o3 Deep ResearchOpenAI	OpenAI	92	200K	$40.00
8	Claude Opus 4.5Anthropic	Anthropic	90	200K	$25.00
9	Gemini 3 Pro PreviewGoogle	Google	90	1.0M	$12.00
10	GPT-5OpenAI	OpenAI	90	400K	$10.00
11	Gemini 3 Flash PreviewGoogle	Google	89	1.0M	$3.00
12	Claude Sonnet 4.5Anthropic	Anthropic	89	1M	$15.00
13	o3 ProOpenAI	OpenAI	88	200K	$80.00
14	o3OpenAI	OpenAI	86	200K	$8.00
15	Gemini 3.1 Pro PreviewGoogle	Google	86	1.0M	$12.00
16	GPT-5.1OpenAI	OpenAI	85	400K	$10.00
17	MiMo-V2-OmniXiaomi	Xiaomi	85	262K	$2.00
18	GPT-5.4 NanoOpenAI	OpenAI	85	400K	$1.25
19	Seed-2.0-LiteByteDance	ByteDance	85	262K	$2.00
20	GPT-5.3 ChatOpenAI	OpenAI	85	128K	$14.00
21	Seed-2.0-MiniByteDance	ByteDance	85	262K	$0.40
22	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	85	1.0M	$12.00
23	GPT-5.3-CodexOpenAI	OpenAI	85	400K	$14.00
24	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	85	1M	$1.56
25	Seed 1.6 FlashByteDance	ByteDance	85	262K	$0.30
26	Seed 1.6ByteDance	ByteDance	85	262K	$2.00
27	GPT-5.1 ChatOpenAI	OpenAI	85	128K	$10.00
28	o4 Mini Deep ResearchOpenAI	OpenAI	85	200K	$8.00
29	o4 Mini HighOpenAI	OpenAI	85	200K	$4.40
30	Gemini 2.5 ProGoogle	Google	85	1.0M	$10.00

What Are Multimodal AI Models?

Beyond Text-Only

Multimodal AI models can process and generate more than just text. They understand images, diagrams, screenshots, and in some cases audio or video. This lets you build applications that interact with the world the way humans do - through multiple senses.

Vision (Image Input)

Vision-capable models accept images alongside text prompts. They can describe photos, extract text via OCR, analyze charts, review UI designs, and answer questions about visual content. Most frontier models now include vision as a core capability.

Image Generation (Image Output)

Some models can generate new images from text descriptions or edit existing ones. These range from dedicated image generators to unified models that handle both text and image output in a single conversation, like GPT-4o with image generation.

Use Cases

Document analysis and OCR, screenshot-to-code, chart interpretation, medical imaging, accessibility descriptions, visual QA, creative image generation, diagram-to-code conversion, and agentic workflows that require visual understanding.

Multimodal AI Models

All Multimodal Models - Ranked by Score

Text + Image Input

Image Output

Multi-Input (3+ Modalities)

What Are Multimodal AI Models?

Beyond Text-Only

Vision (Image Input)

Image Generation (Image Output)

Use Cases

相关页面

Multimodal AI Models

All Multimodal Models - Ranked by Score

Text + Image Input

Image Output

Multi-Input (3+ Modalities)

What Are Multimodal AI Models?

Beyond Text-Only

Vision (Image Input)

Image Generation (Image Output)

Use Cases

相关页面