AI Reasoning Benchmark

How do reasoning models stack up against standard LLMs? This benchmark compares 132 reasoning models against 165 standard models on composite score, pricing, and capabilities — helping you decide when chain-of-thought thinking is worth the trade-off.

Reasoning vs Standard — Head-to-Head

Reasoning Models

132

Models

Top Score

Avg Score

$13.80

Avg $/1M Out

Standard Models

165

Models

Top Score

Avg Score

$4.29

Avg $/1M Out

Reasoning models from 23 providers. Score difference: +14 points average for reasoning models.

Reasoning Models — Ranked by Score

#	Model	Provider	Score	Context	$/1M Out
1	GPT-5.4 ProOpenAI	OpenAI	91	1.1M	$180.00
2	GPT-5.2 ProOpenAI	OpenAI	90	400K	$168.00
3	GPT-5 ProOpenAI	OpenAI	90	400K	$120.00
4	o3 ProOpenAI	OpenAI	82	200K	$80.00
5	Claude Opus 4.1Anthropic	Anthropic	81	200K	$75.00
6	o1-proOpenAI	OpenAI	77	200K	$600.00
7	Claude Opus 4Anthropic	Anthropic	76	200K	$75.00
8	o3 Deep ResearchOpenAI	OpenAI	74	200K	$40.00
9	Claude Opus 4.6Anthropic	Anthropic	71	1M	$25.00
10	Claude Opus 4.5Anthropic	Anthropic	70	200K	$25.00
11	GPT-5.4OpenAI	OpenAI	70	1.1M	$15.00
12	Claude Sonnet 4.5Anthropic	Anthropic	69	1M	$15.00
13	Qwen3 VL 30B A3B ThinkingAlibaba	Alibaba	69	131K	Free
14	Qwen3 VL 235B A22B ThinkingAlibaba	Alibaba	69	131K	Free
15	GPT-5.2OpenAI	OpenAI	68	400K	$14.00
16	Gemini 3.1 Pro Preview Custom ToolsGoogle	Google	68	1.0M	$12.00
17	Gemini 3.1 Pro PreviewGoogle	Google	68	1.0M	$12.00
18	Gemini 3 Pro PreviewGoogle	Google	68	1.0M	$12.00
19	Claude Sonnet 4.6Anthropic	Anthropic	68	1M	$15.00
20	GPT-5.1OpenAI	OpenAI	67	400K	$10.00
21	GPT-5.3-CodexOpenAI	OpenAI	67	400K	$14.00
22	GPT-5.2-CodexOpenAI	OpenAI	67	400K	$14.00
23	GPT-5OpenAI	OpenAI	67	400K	$10.00
24	Gemini 3 Flash PreviewGoogle	Google	66	1.0M	$3.00
25	o4 Mini Deep ResearchOpenAI	OpenAI	66	200K	$8.00
26	GPT-5.1-Codex-MaxOpenAI	OpenAI	66	400K	$10.00
27	Gemini 3.1 Flash Lite PreviewGoogle	Google	66	1.0M	$1.50
28	Gemini 2.5 ProGoogle	Google	66	1.0M	$10.00
29	Gemini 2.5 Flash Lite Preview 09-2025Google	Google	65	1.0M	$0.40
30	GPT-5 MiniOpenAI	OpenAI	65	400K	$2.00
31	Gemini 2.5 Pro Preview 05-06Google	Google	64	1.0M	$10.00
32	GPT-5 NanoOpenAI	OpenAI	64	400K	$0.40
33	Nemotron Nano 12B 2 VL (free)NVIDIA	NVIDIA	64	128K	Free
34	Gemini 2.5 Flash LiteGoogle	Google	64	1.0M	$0.40
35	Grok 4.1 FastxAI	xAI	64	2M	$0.50
36	Grok 4 FastxAI	xAI	64	2M	$0.50
37	Gemini 2.5 FlashGoogle	Google	64	1.0M	$2.50
38	Gemini 2.5 Pro Preview 06-05Google	Google	64	1.0M	$10.00
39	Claude Haiku 4.5Anthropic	Anthropic	63	200K	$5.00
40	Qwen3.5 Plus 2026-02-15Alibaba	Alibaba	62	1M	$1.56
41	GPT-5.1-CodexOpenAI	OpenAI	62	400K	$10.00
42	GPT-5 CodexOpenAI	OpenAI	62	400K	$10.00
43	o3OpenAI	OpenAI	62	200K	$8.00
44	Qwen3.5-FlashAlibaba	Alibaba	62	1M	$0.40
45	o4 Mini HighOpenAI	OpenAI	61	200K	$4.40
46	o4 MiniOpenAI	OpenAI	61	200K	$4.40
47	Seed-2.0-MiniByteDance	ByteDance	61	262K	$0.40
48	Qwen3.5-122B-A10BAlibaba	Alibaba	61	262K	$2.08
49	Qwen3.5 397B A17BAlibaba	Alibaba	61	262K	$2.34
50	Claude Sonnet 4Anthropic	Anthropic	61	200K	$15.00
51	Qwen3.5-35B-A3BAlibaba	Alibaba	61	262K	$1.30
52	Qwen3.5-27BAlibaba	Alibaba	61	262K	$1.56
53	Sonar Pro SearchPerplexity	Perplexity	61	200K	$15.00
54	Nova 2 LiteAmazon	Amazon	61	1M	$2.50
55	Seed 1.6ByteDance	ByteDance	60	262K	$2.00
56	Seed 1.6 FlashByteDance	ByteDance	60	262K	$0.30
57	GPT-5.1-Codex-MiniOpenAI	OpenAI	60	400K	$2.00
58	Kimi K2.5Moonshot AI	Moonshot AI	59	262K	$2.20
59	Claude 3.7 SonnetAnthropic	Anthropic	59	200K	$15.00
60	Claude 3.7 Sonnet (thinking)Anthropic	Anthropic	59	200K	$15.00
61	Step 3.5 Flash (free)StepFun	StepFun	58	256K	Free
62	Grok 4xAI	xAI	58	256K	$15.00
63	Qwen3 VL 8B ThinkingAlibaba	Alibaba	58	131K	$1.36
64	Qwen3 235B A22B Thinking 2507Alibaba	Alibaba	57	131K	Free
65	gpt-oss-120b (free)OpenAI	OpenAI	56	131K	Free
66	gpt-oss-20b (free)OpenAI	OpenAI	56	131K	Free
67	Grok Code Fast 1xAI	xAI	56	256K	$1.50
68	Qwen Plus 0728 (thinking)Alibaba	Alibaba	55	1M	$0.78
69	MiniMax M2.5MiniMax	MiniMax	54	197K	$1.20
70	Trinity Mini (free)arcee-ai	arcee-ai	54	131K	Free
71	MiniMax M2MiniMax	MiniMax	54	197K	$1.00
72	Nemotron Nano 9B V2 (free)NVIDIA	NVIDIA	54	128K	Free
73	Qwen3 Max ThinkingAlibaba	Alibaba	54	262K	$3.90
74	MiMo-V2-FlashXiaomi	Xiaomi	54	262K	$0.29
75	Trinity Miniarcee-ai	arcee-ai	53	131K	$0.15
76	Tongyi DeepResearch 30B A3BAlibaba	Alibaba	53	131K	$0.45
77	DeepSeek V3.2DeepSeek	DeepSeek	53	164K	$0.40
78	DeepSeek V3.2 ExpDeepSeek	DeepSeek	53	164K	$0.41
79	gpt-oss-safeguard-20bOpenAI	OpenAI	53	131K	$0.30
80	Mercury 2Inception	Inception	53	128K	$0.75
81	Nemotron 3 Nano 30B A3B (free)NVIDIA	NVIDIA	51	256K	Free
82	Step 3.5 FlashStepFun	StepFun	51	256K	$0.30
83	ERNIE 4.5 VL 28B A3BBaidu	Baidu	51	30K	$0.56
84	R1 0528DeepSeek	DeepSeek	51	164K	$2.15
85	DeepSeek V3.2 SpecialeDeepSeek	DeepSeek	51	164K	$1.20
86	Nemotron Nano 12B 2 VLNVIDIA	NVIDIA	50	131K	$0.60
87	Qwen3 4B (free)Alibaba	Alibaba	50	41K	Free
88	DeepSeek V3.1DeepSeek	DeepSeek	50	33K	$0.75
89	DeepSeek V3 0324DeepSeek	DeepSeek	49	164K	$0.77
90	MiniMax M1MiniMax	MiniMax	49	1M	$2.20
91	Olmo 3.1 32B ThinkAllen AI	Allen AI	49	66K	$0.50
92	Olmo 3 32B ThinkAllen AI	Allen AI	49	66K	$0.50
93	Olmo 3 7B ThinkAllen AI	Allen AI	49	66K	$0.20
94	Hunyuan A13B InstructTencent	Tencent	48	131K	$0.57
95	ERNIE 4.5 VL 424B A47B Baidu	Baidu	48	123K	$1.25
96	Qwen3 235B A22BAlibaba	Alibaba	48	131K	$1.82
97	Grok 3 MinixAI	xAI	48	131K	$0.50
98	Qwen3 30B A3BAlibaba	Alibaba	48	41K	$0.28
99	Qwen3 14BAlibaba	Alibaba	48	41K	$0.24
100	Qwen3 32BAlibaba	Alibaba	48	41K	$0.24
101	Nemotron 3 Nano 30B A3BNVIDIA	NVIDIA	48	262K	$0.20
102	MiniMax M2.1MiniMax	MiniMax	48	197K	$0.95
103	Kimi K2 ThinkingMoonshot AI	Moonshot AI	47	131K	$2.00
104	DeepSeek V3.1 Terminus (exacto)DeepSeek	DeepSeek	47	164K	$0.79
105	DeepSeek V3.1 TerminusDeepSeek	DeepSeek	47	164K	$0.79
106	Qwen3 Next 80B A3B ThinkingAlibaba	Alibaba	47	128K	$1.20
107	Solar Pro 3Upstage	Upstage	47	128K	$0.60
108	Llama 3.3 Nemotron Super 49B V1.5NVIDIA	NVIDIA	47	131K	$0.40
109	Nemotron Nano 9B V2NVIDIA	NVIDIA	47	131K	$0.16
110	Qwen3 8BAlibaba	Alibaba	47	41K	$0.40
111	Grok 3 Mini BetaxAI	xAI	46	131K	$0.50
112	Qwen3 235B A22B Instruct 2507Alibaba	Alibaba	46	262K	$0.10
113	gpt-oss-120bOpenAI	OpenAI	46	131K	$0.19
114	gpt-oss-120b (exacto)OpenAI	OpenAI	46	131K	$0.19
115	gpt-oss-20bOpenAI	OpenAI	46	131K	$0.14
116	QwQ 32BAlibaba	Alibaba	46	33K	$0.40
117	ERNIE 4.5 21B A3B ThinkingBaidu	Baidu	46	131K	$0.28
118	LFM2.5-1.2B-Thinking (free)Liquid AI	Liquid AI	46	33K	Free
119	Aion-2.0aion-labs	aion-labs	46	131K	$1.60
120	Sonar Reasoning ProPerplexity	Perplexity	45	128K	$8.00
121	Qwen3 30B A3B Thinking 2507Alibaba	Alibaba	45	33K	$0.34
122	Cogito v2.1 671Bdeepcogito	deepcogito	44	128K	$1.25
123	R1 Distill Llama 70BDeepSeek	DeepSeek	42	131K	$0.80
124	R1DeepSeek	DeepSeek	42	64K	$2.50
125	R1 Distill Qwen 32BDeepSeek	DeepSeek	41	33K	$0.29
126	Aion-1.0aion-labs	aion-labs	41	131K	$8.00
127	Sonar Deep ResearchPerplexity	Perplexity	40	128K	$8.00
128	Aion-1.0-Miniaion-labs	aion-labs	40	131K	$1.40
129	Nano Banana 2 (Gemini 3.1 Flash Image Preview)Google	Google	—	66K	$3.00
130	Nano Banana Pro (Gemini 3 Pro Image Preview)Google	Google	—	66K	$12.00
131	GPT-5 Image MiniOpenAI	OpenAI	—	400K	$2.00
132	GPT-5 ImageOpenAI	OpenAI	—	400K	$10.00

Top Standard Models (Non-Reasoning) — For Comparison

#	Model	Provider	Score	Context	$/1M Out
1	o1OpenAI	OpenAI	65	200K	$60.00
2	GPT-5.3 ChatOpenAI	OpenAI	62	128K	$14.00
3	GPT-5.2 ChatOpenAI	OpenAI	62	128K	$14.00
4	GPT-5.1 ChatOpenAI	OpenAI	61	128K	$10.00
5	GPT-4.1OpenAI	OpenAI	59	1.0M	$8.00
6	GPT-4.1 MiniOpenAI	OpenAI	58	1.0M	$1.60
7	GPT-4.1 NanoOpenAI	OpenAI	58	1.0M	$0.40
8	GPT-5 ChatOpenAI	OpenAI	57	128K	$10.00
9	Nova Premier 1.0Amazon	Amazon	56	1M	$12.50
10	Gemma 3 27B (free)Google	Google	56	131K	Free

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

Chain-of-thought (CoT) prompting enables AI models to break down complex problems into intermediate steps before producing a final answer. Models like OpenAI o1 and DeepSeek R1 internalize this process, generating hidden reasoning traces that dramatically improve accuracy on math, logic, and multi-step tasks compared to direct answering.

When Reasoning Helps

Reasoning models shine on tasks that require multiple logical steps: mathematical proofs, complex coding challenges, scientific analysis, strategic planning, and any problem where standard models tend to hallucinate or skip steps. For simple Q&A or creative writing, standard models are often faster and equally effective.

Speed vs Accuracy

Reasoning models consume more tokens and take longer to respond because they generate internal thinking traces. This trade-off is worthwhile when correctness matters more than latency — for example in code generation, financial analysis, or exam-style problems. For real-time chat, standard models remain the better choice.

Emerging Reasoning Models

The reasoning model landscape is evolving rapidly. OpenAI's o1 and o3 series led the way, followed by DeepSeek R1 bringing open-source reasoning. Google, Anthropic, and other providers have since introduced their own reasoning-capable models, driving down costs and expanding access to chain-of-thought capabilities.

Reasoning Models Best for Math Best for Coding Speed Comparison Compare Models Full Leaderboard

AI Reasoning Benchmark

Reasoning vs Standard — Head-to-Head

Reasoning Models — Ranked by Score

Top Standard Models (Non-Reasoning) — For Comparison

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

When Reasoning Helps

Speed vs Accuracy

Emerging Reasoning Models

Related Pages

AI Reasoning Benchmark

Reasoning vs Standard — Head-to-Head

Reasoning Models — Ranked by Score

Top Standard Models (Non-Reasoning) — For Comparison

Understanding AI Reasoning Benchmarks

What Is Chain-of-Thought?

When Reasoning Helps

Speed vs Accuracy

Emerging Reasoning Models

Related Pages