Compare top models across the benchmark suite that best represents math performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.
3
Benchmarks in category
25
Models with coverage
0
Benchmarks with human baseline
1
Saturated benchmarks
The current benchmark set in this category, with context on what each test captures.
Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
Tests mathematical reasoning at competition level. Reasoning models achieve 70-90% while standard models struggle below 30%. Best differentiator for math ability.
Useful baseline for math ability, but now saturated — top models exceed 95%. More useful for evaluating smaller or open-source models.
Competition-level mathematics across algebra, geometry, number theory, counting/probability, intermediate algebra, and precalculus. 500-problem subset used by OpenAI for consistent evaluation.
Why it matters
Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
| # | Model | Score |
|---|---|---|
| 1 | 🥇o3-mini | 97.9% |
| 2 | 🥈DeepSeek R1 | 97.3% |
| 3 | 🥉o1 | 96.4% |
| 4 | Gemini 2.5 Pro | 95.2% |
| 5 | DeepSeek V3 | 90.2% |
| 6 | o1-mini | 90.0% |
| 7 | Gemini 2.0 Flash | 89.7% |
| 8 | Claude Opus 4.5 | 88.1% |
| 9 | Qwen 2.5 Coder 32B | 83.5% |
| 10 | Qwen 2.5 72B | 83.1% |
| 11 | Claude 3.7 Sonnet | 82.2% |
| 12 | Claude 3.5 Sonnet | 78.3% |
| 13 | Llama 3.3 70B | 77.0% |
| 14 | GPT-4o | 76.6% |
| 15 | Grok 2 | 76.1% |
| 16 | Mistral Large 2 | 76.0% |
| 17 | Llama 3.1 405B | 73.8% |
| 18 | GPT-4 Turbo | 72.6% |
| 19 | GPT-4o mini | 70.2% |
| 20 | Claude 3.5 Haiku | 69.2% |
| 21 | Llama 3.1 70B | 68.0% |
| 22 | Gemini 1.5 Pro | 67.7% |
| 23 | Claude 3 Opus | 60.1% |
| 24 | Mixtral 8x22B | 60.0% |
Performance Tiers
Model Types
Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.
Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.
AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.
Each benchmark has its own scoring method — accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.
A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.