AI Reasoning Benchmark

How do reasoning models stack up against standard LLMs? This benchmark compares 149 reasoning models against 158 standard models on composite score, pricing, and capabilities - helping you decide when chain-of-thought thinking is worth the trade-off.

Reasoning vs Standard - Head-to-Head

Reasoning Models

149

Models

Top Score

Avg Score

$12.97

Avg $/1M Out

Standard Models

158

Models

Top Score

Avg Score

$3.97

Avg $/1M Out

Reasoning models from 26 providers. Score difference: +19 points average for reasoning models.

Reasoning Models - Ranked by Score

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.4 ProOpenAI	OpenAI	94	1.1M	$180.00
2	GPT-5.4OpenAI	OpenAI	94	1.1M	$15.00
3	GPT-5.4 MiniOpenAI	OpenAI	93	400K	$4.50
4	GPT-5.2 ProOpenAI	OpenAI	93	400K	$168.00
5	GPT-5.2OpenAI	OpenAI	93	400K	$14.00
6	Claude Opus 4.6Anthropic	Anthropic	92	1M	$25.00
7	GPT-5 ProOpenAI	OpenAI	92	400K	$120.00
8	o3 Deep ResearchOpenAI	OpenAI	92	200K	$40.00
9	Claude Opus 4.5Anthropic	Anthropic	90	200K	$25.00
10	Gemini 3 Pro PreviewGoogle	Google	90	1.0M	$12.00
11	GPT-5OpenAI	OpenAI	90	400K	$10.00
12	Gemini 3 Flash PreviewGoogle	Google	89	1.0M	$3.00
13	Claude Sonnet 4.6Anthropic	Anthropic	89	1M	$15.00
14	Claude Sonnet 4.5Anthropic	Anthropic	89	1M	$15.00
15	o3 ProOpenAI	OpenAI	88	200K	$80.00
16	Grok 4.1 FastxAI	xAI	87	2M	$0.50
17	Grok 4xAI	xAI	86	256K	$15.00
18	Grok 4.20 BetaxAI	xAI	86	2M	$6.00
19	o3OpenAI	OpenAI	86	200K	$8.00
20	Gemini 3.1 Pro PreviewGoogle	Google	86	1.0M	$12.00
21	GPT-5.1OpenAI	OpenAI	85	400K	$10.00
22	MiMo-V2-OmniXiaomi	Xiaomi	85	262K	$2.00
23	MiMo-V2-ProXiaomi	Xiaomi	85	1.0M	$3.00
24	GPT-5.4 NanoOpenAI	OpenAI	85	400K	$1.25
25	Seed-2.0-LiteByteDance	ByteDance	85	262K	$2.00
26	Seed-2.0-MiniByteDance	ByteDance	85	262K	$0.40
27	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	85	1.0M	$12.00
28	GPT-5.3-CodexOpenAI	OpenAI	85	400K	$14.00
29	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	85	1M	$1.56
30	Kimi K2.5Moonshot AI	Moonshot AI	85	262K	$2.20
31	GPT-5.2-CodexOpenAI	OpenAI	85	400K	$14.00
32	Seed 1.6 FlashByteDance	ByteDance	85	262K	$0.30
33	Seed 1.6ByteDance	ByteDance	85	262K	$2.00
34	GPT-5.1-Codex-MaxOpenAI	OpenAI	85	400K	$10.00
35	GPT-5.1-CodexOpenAI	OpenAI	85	400K	$10.00
36	GPT-5.1-Codex-MiniOpenAI	OpenAI	85	400K	$2.00
37	Sonar Pro SearchPerplexity	Perplexity	85	200K	$15.00
38	Qwen3 VL 8B ThinkingAlibaba	Alibaba	85	131K	$1.36
39	o4 Mini Deep ResearchOpenAI	OpenAI	85	200K	$8.00
40	Qwen3 VL 30B A3B ThinkingAlibaba	Alibaba	85	131K	$1.56
41	GPT-5 CodexOpenAI	OpenAI	85	400K	$10.00
42	o4 Mini HighOpenAI	OpenAI	85	200K	$4.40
43	Grok Code Fast 1xAI	xAI	85	256K	$1.50
44	Gemini 2.5 ProGoogle	Google	85	1.0M	$10.00
45	Gemini 2.5 Pro Preview 06-05Google	Google	84	1.0M	$10.00
46	Nemotron 3 Super (free)NVIDIA	NVIDIA	84	262K	Free
47	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	84	1.0M	$0.40
48	o4 MiniOpenAI	OpenAI	84	200K	$4.40
49	MiniMax M2.5 (free)MiniMax	MiniMax	83	197K	Free
50	Grok 4 FastxAI	xAI	83	2M	$0.50
51	MiniMax M2.7MiniMax	MiniMax	83	205K	$1.20
52	Claude Haiku 4.5Anthropic	Anthropic	83	200K	$5.00
53	Qwen Plus 0728 (thinking)Alibaba	Alibaba	83	1M	$0.78
54	Gemini 2.5 Pro Preview 05-06Google	Google	83	1.0M	$10.00
55	MiMo-V2-FlashXiaomi	Xiaomi	83	262K	$0.29
56	Trinity Miniarcee-ai	arcee-ai	82	131K	$0.15
57	Nemotron Nano 12B 2 VL (free)NVIDIA	NVIDIA	82	128K	Free
58	Grok 4.20 Multi-Agent BetaxAI	xAI	82	2M	$6.00
59	Tongyi DeepResearch 30B A3BAlibaba	Alibaba	82	131K	$0.45
60	Claude Opus 4.1Anthropic	Anthropic	82	200K	$75.00
61	Gemini 3.1 Flash Lite PreviewGoogle	Google	82	1.0M	$1.50
62	Qwen3.5 397B A17BAlibaba	Alibaba	82	262K	$2.34
63	Qwen3 Max ThinkingAlibaba	Alibaba	82	262K	$3.90
64	Claude Opus 4Anthropic	Anthropic	82	200K	$75.00
65	gpt-oss-safeguard-20bOpenAI	OpenAI	82	131K	$0.30
66	Gemini 2.5 Flash LiteGoogle	Google	81	1.0M	$0.40
67	Mercury 2Inception	Inception	81	128K	$0.75
68	Qwen3 30B A3B Thinking 2507Alibaba	Alibaba	81	131K	$0.40
69	Gemini 2.5 FlashGoogle	Google	80	1.0M	$2.50
70	Claude Sonnet 4Anthropic	Anthropic	80	200K	$15.00
71	Qwen3.5-122B-A10BAlibaba	Alibaba	80	262K	$2.08
72	Mistral Small 4Mistral AI	Mistral AI	79	262K	$0.60
73	Qwen3.5-FlashAlibaba	Alibaba	79	1M	$0.26
74	Qwen3.5-9BAlibaba	Alibaba	79	256K	$0.15
75	GPT-5 MiniOpenAI	OpenAI	79	400K	$2.00
76	Qwen3.5-27BAlibaba	Alibaba	79	262K	$1.56
77	Qwen3.5-35B-A3BAlibaba	Alibaba	78	262K	$1.30
78	Step 3.5 Flash (free)StepFun	StepFun	78	256K	Free
79	R1 0528DeepSeek	DeepSeek	78	164K	$2.15
80	Qwen3 VL 235B A22B ThinkingAlibaba	Alibaba	77	131K	$2.60
81	DeepSeek V3.2 ExpDeepSeek	DeepSeek	77	164K	$0.41
82	DeepSeek V3.2 SpecialeDeepSeek	DeepSeek	77	164K	$1.20
83	Claude 3.7 SonnetAnthropic	Anthropic	77	200K	$15.00
84	o1-proOpenAI	OpenAI	77	200K	$600.00
85	Composer 2Cursor	Cursor	76	200K	$2.50
86	Composer 2 FastCursor	Cursor	76	200K	$7.50
87	Grok 3 MinixAI	xAI	76	131K	$0.50
88	MiniMax M2.5MiniMax	MiniMax	76	197K	$1.17
89	o1OpenAI	OpenAI	76	200K	$60.00
90	GPT-5 NanoOpenAI	OpenAI	76	400K	$0.40
91	ERNIE 4.5 VL 28B A3BBaidu	Baidu	75	30K	$0.56
92	DeepSeek V3.2DeepSeek	DeepSeek	74	164K	$0.38
93	DeepSeek V3.1DeepSeek	DeepSeek	74	33K	$0.75
94	gpt-oss-120b (free)OpenAI	OpenAI	74	131K	Free
95	gpt-oss-20b (free)OpenAI	OpenAI	74	131K	Free
96	DeepSeek V3.1 TerminusDeepSeek	DeepSeek	74	164K	$0.79
97	Nemotron 3 SuperNVIDIA	NVIDIA	74	262K	$0.50
98	Nemotron 3 Nano 30B A3BNVIDIA	NVIDIA	74	262K	$0.20
99	o3 MiniOpenAI	OpenAI	73	200K	$4.40
100	Step 3.5 FlashStepFun	StepFun	73	256K	$0.30
101	DeepSeek V3 0324DeepSeek	DeepSeek	73	164K	$0.77
102	MiniMax M2.1MiniMax	MiniMax	73	197K	$0.95
103	Nova 2 LiteAmazon	Amazon	73	1M	$2.50
104	MiniMax M2MiniMax	MiniMax	73	197K	$1.00
105	Qwen3 Next 80B A3B ThinkingAlibaba	Alibaba	73	131K	$0.78
106	Trinity Mini (free)arcee-ai	arcee-ai	73	131K	Free
107	Kimi K2 ThinkingMoonshot AI	Moonshot AI	73	131K	$2.00
108	Nemotron Nano 12B 2 VLNVIDIA	NVIDIA	73	131K	$0.60
109	Solar Pro 3Upstage	Upstage	73	128K	$0.60
110	Hunyuan A13B InstructTencent	Tencent	72	131K	$0.57
111	Nemotron Nano 9B V2 (free)NVIDIA	NVIDIA	72	128K	Free
112	Nemotron Nano 9B V2NVIDIA	NVIDIA	72	131K	$0.16
113	Qwen3 30B A3BAlibaba	Alibaba	71	41K	$0.28
114	Qwen3 14BAlibaba	Alibaba	71	41K	$0.24
115	Qwen3 32BAlibaba	Alibaba	71	41K	$0.24
116	Qwen3 235B A22BAlibaba	Alibaba	71	131K	$1.82
117	ERNIE 4.5 21B A3B ThinkingBaidu	Baidu	70	131K	$0.28
118	Qwen3 235B A22B Instruct 2507Alibaba	Alibaba	70	262K	$0.10
119	Claude 3.7 Sonnet (thinking)Anthropic	Anthropic	70	200K	$15.00
120	ERNIE 4.5 VL 424B A47B Baidu	Baidu	70	123K	$1.25
121	Qwen3 235B A22B Thinking 2507Alibaba	Alibaba	69	131K	$1.50
122	Aion-2.0aion-labs	aion-labs	69	131K	$1.60
123	Llama 3.3 Nemotron Super 49B V1.5NVIDIA	NVIDIA	69	131K	$0.40
124	gpt-oss-20bOpenAI	OpenAI	69	131K	$0.11
125	MiniMax M1MiniMax	MiniMax	68	1M	$2.20
126	R1DeepSeek	DeepSeek	68	64K	$2.50
127	Nemotron 3 Nano 30B A3B (free)NVIDIA	NVIDIA	68	256K	Free
128	gpt-oss-120bOpenAI	OpenAI	68	131K	$0.19
129	Cogito v2.1 671Bdeepcogito	deepcogito	67	128K	$1.25
130	Olmo 3 32B ThinkAllen AI	Allen AI	66	66K	$0.50
131	Grok 3 Mini BetaxAI	xAI	66	131K	$0.50
132	o3 Mini HighOpenAI	OpenAI	65	200K	$4.40
133	Qwen3 8BAlibaba	Alibaba	65	41K	$0.40
134	Olmo 3.1 32B ThinkAllen AI	Allen AI	65	66K	$0.50
135	Qwen3 4B (free)Alibaba	Alibaba	63	41K	Free
136	Sonar Reasoning ProPerplexity	Perplexity	62	128K	$8.00
137	R1 Distill Llama 70BDeepSeek	DeepSeek	61	131K	$0.80
138	R1 Distill Qwen 32BDeepSeek	DeepSeek	60	33K	$0.29
139	LFM2.5-1.2B-Thinking (free)Liquid AI	Liquid AI	59	33K	Free
140	Llama 3.1 Nemotron Ultra 253B v1NVIDIA	NVIDIA	58	131K	$1.80
141	Aion-1.0aion-labs	aion-labs	57	131K	$8.00
142	Aion-1.0-Miniaion-labs	aion-labs	57	131K	$1.40
143	Sonar Deep ResearchPerplexity	Perplexity	56	128K	$8.00
144	SWE-1.5Windsurf	Windsurf	49	0	Free
145	QwQ 32BAlibaba	Alibaba	47	131K	$0.58
146	Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google	Google	—	66K	$3.00
147	Nano Banana Pro (Gemini 3 Pro Image Preview)Google	Google	—	66K	$12.00
148	GPT-5 Image MiniOpenAI	OpenAI	—	400K	$2.00
149	GPT-5 ImageOpenAI	OpenAI	—	400K	$10.00

Top Standard Models (Non-Reasoning) - For Comparison

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.3 ChatOpenAI	OpenAI	85	128K	$14.00
2	GPT-5.1 ChatOpenAI	OpenAI	85	128K	$10.00
3	GPT-5.2 ChatOpenAI	OpenAI	83	128K	$14.00
4	Qwen3 VL 32B InstructAlibaba	Alibaba	81	131K	$0.42
5	Qwen3 VL 8B InstructAlibaba	Alibaba	81	131K	$0.50
6	Qwen3 VL 30B A3B InstructAlibaba	Alibaba	81	131K	$0.52
7	GPT-4.1 NanoOpenAI	OpenAI	81	1.0M	$0.40
8	Qwen3 Coder PlusAlibaba	Alibaba	79	1M	$3.25
9	Qwen3 Coder FlashAlibaba	Alibaba	78	1M	$0.97
10	Nova Premier 1.0Amazon	Amazon	78	1M	$12.50

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

Chain-of-thought (CoT) prompting enables AI models to break down complex problems into intermediate steps before producing a final answer. Models like OpenAI o1 and DeepSeek R1 internalize this process, generating hidden reasoning traces that dramatically improve accuracy on math, logic, and multi-step tasks compared to direct answering.

When Reasoning Helps

Reasoning models shine on tasks that require multiple logical steps: mathematical proofs, complex coding challenges, scientific analysis, strategic planning, and any problem where standard models tend to hallucinate or skip steps. For simple Q&A or creative writing, standard models are often faster and equally effective.

Speed vs Accuracy

Reasoning models consume more tokens and take longer to respond because they generate internal thinking traces. This trade-off is worthwhile when correctness matters more than latency - for example in code generation, financial analysis, or exam-style problems. For real-time chat, standard models remain the better choice.

Emerging Reasoning Models

The reasoning model landscape is evolving rapidly. OpenAI's o1 and o3 series led the way, followed by DeepSeek R1 bringing open-source reasoning. Google, Anthropic, and other providers have since introduced their own reasoning-capable models, driving down costs and expanding access to chain-of-thought capabilities.

AI Reasoning Benchmark

Reasoning vs Standard - Head-to-Head

Reasoning Models - Ranked by Score

Top Standard Models (Non-Reasoning) - For Comparison

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

When Reasoning Helps

Speed vs Accuracy

Emerging Reasoning Models

相关页面

AI Reasoning Benchmark

Reasoning vs Standard - Head-to-Head

Reasoning Models - Ranked by Score

Top Standard Models (Non-Reasoning) - For Comparison

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

When Reasoning Helps

Speed vs Accuracy

Emerging Reasoning Models

相关页面