Last updated: just now

Math benchmark

MATH Benchmark (500-problem subset)

Competition-level mathematics across algebra, geometry, number theory, counting/probability, intermediate algebra, and precalculus. 500-problem subset used by OpenAI for consistent evaluation.

Why it matters: Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

Top Model

97.9%

o3-mini

Average Score

80.0%

Across 24 models

Models Tested

Metric: accuracy

Human Baseline

—

Range: 0%–100%

Model Rankings

All models with a reported MATH-500 score, ranked by highest accuracy.

RankModelScorePerformance

o3-miniOpenAI

97.9%

98%

97.9%

DeepSeek R1DeepSeek

97.3%

97%

97.3%

o1OpenAI

96.4%

96%

96.4%

Gemini 2.5 ProGoogle

95.2%

95%

95.2%

DeepSeek V3DeepSeek

90.2%

90%

90.2%

o1-miniOpenAI

90%

Gemini 2.0 FlashGoogle

89.7%

90%

89.7%

Claude Opus 4.5Anthropic

88.1%

88%

88.1%

Qwen 2.5 Coder 32BAlibaba

83.5%

84%

83.5%

#10

Qwen 2.5 72BAlibaba

83.1%

83%

83.1%

#11

Claude 3.7 SonnetAnthropic

82.2%

82%

82.2%

#12

Claude 3.5 SonnetAnthropic

78.3%

78%

78.3%

#13

Llama 3.3 70BMeta

77%

#14

GPT-4oOpenAI

76.6%

77%

76.6%

#15

Grok 2xAI

76.1%

76%

76.1%

#16

Mistral Large 2Mistral AI

76%

#17

Llama 3.1 405BMeta

73.8%

74%

73.8%

#18

GPT-4 TurboOpenAI

72.6%

73%

72.6%

#19

GPT-4o miniOpenAI

70.2%

70%

70.2%

#20

Claude 3.5 HaikuAnthropic

69.2%

69%

69.2%

#21

Llama 3.1 70BMeta

68%

#22

Gemini 1.5 ProGoogle

67.7%

68%

67.7%

#23

Claude 3 OpusAnthropic

60.1%

60%

60.1%

#24

Mixtral 8x22BMistral AI

60%

About MATH-500

Full Name: MATH Benchmark (500-problem subset)
Category: Math
Metric: accuracy (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

MATH-500 is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

o3-mini currently holds the top score on the MATH-500 benchmark. See our full rankings table above for the complete leaderboard with 24 models.

We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While MATH-500 is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

All Benchmarks|Math Benchmarks|Compare Models

MATH Benchmark (500-problem subset)

Competition-level mathematics across algebra, geometry, number theory, counting/probability, intermediate algebra, and precalculus. 500-problem subset used by OpenAI for consistent evaluation.

Why it matters: Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

Model Rankings

All models with a reported MATH-500 score, ranked by highest accuracy.