Best AI Models for Math

AI models ranked by mathematical reasoning using MATH-500, GSM8K, and AIME 2024 benchmark scores.

Last updated: just now

#1 Model

Score: 98

Average Score

84.6

Across all ranked models

Models Ranked

With benchmark data

Weights:MATH-500 (40%)GSM8K (30%)AIME 2024 (30%)

#	Model	Provider	Score	MATH-500	GSM8K	AIME 2024
1	o3OpenAI	OpenAI	98	99	--	96.7
2	o4 MiniOpenAI	OpenAI	95.6	97.3	--	93.4
3	GPT-5.4OpenAI	OpenAI	95.5	95.5	--	--
4	Gemini 3 Pro PreviewGoogle	Google	94.7	96	--	93
5	GPT-5.2OpenAI	OpenAI	94	94	--	--
6	Gemini 2.5 ProGoogle	Google	93.8	95.2	--	92
7	o3 MiniOpenAI	OpenAI	93.4	97.9	--	87.3
8	DeepSeek V3DeepSeek	DeepSeek	93	90.2	96.7	--
9	GPT-5OpenAI	OpenAI	92.5	92.5	--	--
10	Grok 4xAI	xAI	92	95	--	88
11	DeepSeek V3 0324DeepSeek	DeepSeek	92	92	--	--
12	R1 0528DeepSeek	DeepSeek	91.9	97.8	--	84
13	Gemini 2.0 FlashGoogle	Google	91.5	89.7	93.8	--
14	o1OpenAI	OpenAI	90.8	96.4	--	83.3
15	R1DeepSeek	DeepSeek	89.8	97.3	--	79.8
16	Gemini 3 Flash PreviewGoogle	Google	88	88	--	--
17	Claude 3.5 SonnetAnthropic	Anthropic	86.1	78.3	96.4	--
18	Gemini 2.5 FlashGoogle	Google	85.8	85.8	--	--
19	Gemma 2 27BGoogle	Google	85.3	--	85.3	--
20	Claude Opus 4.6Anthropic	Anthropic	85.1	90.5	--	78
21	GPT-4oOpenAI	OpenAI	84.8	76.6	95.8	--
22	GPT-4 TurboOpenAI	OpenAI	81.9	72.6	94.2	--
23	Claude Opus 4.5Anthropic	Anthropic	81.2	88.1	--	72
24	Llama 4 MaverickMeta	Meta	81	81	--	--
25	Phi 4Microsoft	Microsoft	80.4	80.4	--	--
26	GPT-4o-miniOpenAI	OpenAI	80.1	70.2	93.2	--
27	Llama 3.1 70B InstructMeta	Meta	79.6	68	95.1	--
28	GPT-4.1OpenAI	OpenAI	78.5	78.5	--	--
29	Claude Opus 4Anthropic	Anthropic	78.3	86	--	68
30	Llama 3.3 70B InstructMeta	Meta	77	77	--	--
31	Claude Sonnet 4.6Anthropic	Anthropic	76.6	85.3	--	65
32	Mistral LargeMistral AI	Mistral AI	76	76	--	--
33	Claude Sonnet 4.5Anthropic	Anthropic	73.1	83	--	60
34	Claude Haiku 4.5Anthropic	Anthropic	72.5	72.5	--	--
35	Grok 3xAI	xAI	72.1	85	--	55
36	Claude Sonnet 4Anthropic	Anthropic	71.4	81.4	--	58
37	Claude 3.7 SonnetAnthropic	Anthropic	70.5	82.2	--	55
38	Claude 3.5 HaikuAnthropic	Anthropic	69.2	69.2	--	--

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.