Benchmark category

Math Benchmarks

Compare top models across the benchmark suite that best represents math performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.

Benchmarks in category

Models with coverage

Benchmarks with human baseline

Saturated benchmarks

What Is Included

The current benchmark set in this category, with context on what each test captures.

All benchmarks

MATH Benchmark (500-problem subset)

Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

accuracy %

American Invitational Mathematics Examination 2024

Tests mathematical reasoning at competition level. Reasoning models achieve 70-90% while standard models struggle below 30%. Best differentiator for math ability.

accuracy %

Grade School Math 8K

Saturated

Useful baseline for math ability, but now saturated — top models exceed 95%. More useful for evaluating smaller or open-source models.

accuracy %

Model type:

MATH Benchmark (500-problem subset)

Competition-level mathematics across algebra, geometry, number theory, counting/probability, intermediate algebra, and precalculus. 500-problem subset used by OpenAI for consistent evaluation.

Metric: accuracy %

Why it matters

Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

MATH-500 Scores (24 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇o3-mini	OpenAI	reasoning	97.9%	Elite
2	🥈DeepSeek R1	DeepSeek	reasoning	97.3%	Elite
3	🥉o1	OpenAI	reasoning	96.4%	Elite
4	Gemini 2.5 Pro	Google	hybrid	95.2%	Elite
5	DeepSeek V3	DeepSeek	standard	90.2%	Elite
6	o1-mini	OpenAI	reasoning	90.0%	Elite
7	Gemini 2.0 Flash	Google	standard	89.7%	Strong
8	Claude Opus 4.5	Anthropic	hybrid	88.1%	Strong
9	Qwen 2.5 Coder 32B	Alibaba	standard	83.5%	Strong
10	Qwen 2.5 72B	Alibaba	standard	83.1%	Strong
11	Claude 3.7 Sonnet	Anthropic	hybrid	82.2%	Strong
12	Claude 3.5 Sonnet	Anthropic	standard	78.3%	Strong
13	Llama 3.3 70B	Meta	standard	77.0%	Strong
14	GPT-4o	OpenAI	standard	76.6%	Strong
15	Grok 2	xAI	standard	76.1%	Strong
16	Mistral Large 2	Mistral AI	standard	76.0%	Strong
17	Llama 3.1 405B	Meta	standard	73.8%	Good
18	GPT-4 Turbo	OpenAI	standard	72.6%	Good
19	GPT-4o mini	OpenAI	standard	70.2%	Good
20	Claude 3.5 Haiku	Anthropic	standard	69.2%	Good
21	Llama 3.1 70B	Meta	standard	68.0%	Good
22	Gemini 1.5 Pro	Google	standard	67.7%	Good
23	Claude 3 Opus	Anthropic	standard	60.1%	Good
24	Mixtral 8x22B	Mistral AI	standard	60.0%	Good

How to Read This Page

Performance Tiers

Elite — Top 10% of the score range

Strong — Top 25% of the score range

Good — Above the midpoint

Below Average — Below the midpoint

Model Types

Standard — Direct inference, no chain-of-thought

Reasoning — Extended thinking (o1, R1) — slower but excels on math/reasoning

Hybrid — Optional thinking mode (Claude 3.7, Gemini 2.5) — can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Frequently Asked Questions

AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.

Each benchmark has its own scoring method — accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.

A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.