Competition-level mathematics across algebra, geometry, number theory, counting/probability, intermediate algebra, and precalculus. 500-problem subset used by OpenAI for consistent evaluation.
Why it matters: Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
Top Model
97.9%
o3-mini
Average Score
80.0%
Across 24 models
Models Tested
24
Metric: accuracy
Human Baseline
—
Range: 0%–100%
All models with a reported MATH-500 score, ranked by highest accuracy.
MATH-500 is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
o3-mini currently holds the top score on the MATH-500 benchmark. See our full rankings table above for the complete leaderboard with 24 models.
We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While MATH-500 is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.