Best AI Models for Reasoning

AI models ranked by reasoning ability using GPQA, ARC-Challenge, BIG-Bench Hard, and Humanity's Last Exam scores.

Last updated: just now

#1 Model

GPT-4o

Score: 96.4

Average Score

33.3

Across all ranked models

Models Ranked

With benchmark data

Weights:GPQA (40%)ARC-Challenge (20%)BIG-Bench Hard (20%)Humanity's Last Exam (20%)

#	Model	Provider	Score	GPQA	ARC-Challenge	BIG-Bench Hard	Humanity's Last Exam
1	GPT-4oOpenAI	OpenAI	96.4	--	96.4	--	--
2	GPT-4o-miniOpenAI	OpenAI	96.4	--	96.4	--	--
3	Llama 3.1 70B InstructMeta	Meta	94.8	--	94.8	--	--
4	Gemini 2.5 ProGoogle	Google	35.2	--	--	--	35.2
5	Claude Opus 4.5Anthropic	Anthropic	32.1	--	--	--	32.1
6	Qwen2.5 72B InstructAlibaba	Alibaba	31.7	16.7	--	61.9	--
7	Phi 4Microsoft	Microsoft	31.6	20.8	--	53.3	--
8	Qwen2.5 Coder 32B InstructAlibaba	Alibaba	26.2	13.2	--	52.3	--
9	o3 MiniOpenAI	OpenAI	20.3	--	--	--	20.3
10	Gemma 2 9BGoogle	Google	18.4	9.7	--	35.6	--
11	Command R7B (12-2024)Cohere	Cohere	17.2	7.8	--	36	--
12	Qwen2.5 7B InstructAlibaba	Alibaba	15.3	5.5	--	34.9	--
13	Llama 3.1 8B InstructMeta	Meta	15.2	7.4	--	30.9	--
14	Qwen2.5 Coder 7B InstructAlibaba	Alibaba	13.5	5.8	--	28.9	--
15	Llama 3.2 3B InstructMeta	Meta	12.2	6.2	--	24.2	--
16	Llama 3 8B InstructMeta	Meta	7.5	2.1	--	18.4	--
17	QwQ 32BAlibaba	Alibaba	1.9	1.3	--	2.9	--

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.

Other Specialty Leaderboards

Best for Coding Best for Math Best for Writing Best for Instructions Best for Data Analysis Best for Roleplay Best for Multilingual

Frequently Asked Questions

Based on our benchmark analysis, GPT-4o by OpenAI is currently the #1 ranked model for reasoning, with a weighted score of 96.4/100.

Models are ranked using a weighted average of GPQA, ARC-Challenge, BIG-Bench Hard, Humanity's Last Exam benchmark scores. All scores are normalized to a 0-100 scale.

We currently rank 17 models that have relevant benchmark data for reasoning tasks.