Benchmark category

Reasoning Benchmarks

Compare top models across the benchmark suite that best represents reasoning performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.

Benchmarks in category

Models with coverage

Benchmarks with human baseline

Saturated benchmarks

What Is Included

The current benchmark set in this category, with context on what each test captures.

All benchmarks

Graduate-Level Google-Proof Q&A (Diamond)

One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.

accuracy %Human 65%

AI2 Reasoning Challenge (Challenge Set)

Saturated

Tests commonsense scientific reasoning. Largely saturated for frontier models but still useful for comparing mid-tier and open-source models.

accuracy %

HellaSwag Commonsense NLI

Saturated

Fundamental commonsense reasoning test. Saturated for frontier models (>95%) but useful for evaluating smaller models.

accuracy %Human 95.6%

Humanity's Last Exam

The hardest academic benchmark — top models still fail 60-65% of questions. Shows how far we are from genuine expert-level reasoning.

accuracy %

Model type:

Graduate-Level Google-Proof Q&A (Diamond)

Expert-level science reasoning across biology, chemistry, and physics at PhD level. Questions are designed to be 'Google-proof' — even domain experts with web access struggle.

Metric: accuracy %Human baseline: 65%

Why it matters

One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.

GPQA Diamond Scores (20 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇Claude Opus 4.5	Anthropic	hybrid	86.2%	Strong
2	🥈Claude 3.7 Sonnet	Anthropic	hybrid	84.8%	Strong
3	🥉Gemini 2.5 Pro	Google	hybrid	84.0%	Strong
4	o3-mini	OpenAI	reasoning	79.7%	Strong
5	o1	OpenAI	reasoning	78.0%	Strong
6	DeepSeek R1	DeepSeek	reasoning	71.5%	Good
7	Claude 3.5 Sonnet	Anthropic	standard	65.0%	Good
8	Gemini 2.0 Flash	Google	standard	62.1%	Good
9	o1-mini	OpenAI	reasoning	60.0%	Good
10	Gemini 1.5 Pro	Google	standard	59.1%	Good
11	DeepSeek V3	DeepSeek	standard	59.1%	Good
12	GPT-4o	OpenAI	standard	53.6%	Good
13	Llama 3.1 405B	Meta	standard	51.1%	Good
14	Llama 3.3 70B	Meta	standard	50.5%	Good
15	Claude 3 Opus	Anthropic	standard	50.4%	Good
16	Qwen 2.5 72B	Alibaba	standard	49.0%	Below Avg
17	GPT-4 Turbo	OpenAI	standard	48.0%	Below Avg
18	Llama 3.1 70B	Meta	standard	46.7%	Below Avg
19	Claude 3.5 Haiku	Anthropic	standard	41.6%	Below Avg
20	GPT-4o mini	OpenAI	standard	40.2%	Below Avg

How to Read This Page

Performance Tiers

Elite — Top 10% of the score range

Strong — Top 25% of the score range

Good — Above the midpoint

Below Average — Below the midpoint

Model Types

Standard — Direct inference, no chain-of-thought

Reasoning — Extended thinking (o1, R1) — slower but excels on math/reasoning

Hybrid — Optional thinking mode (Claude 3.7, Gemini 2.5) — can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Frequently Asked Questions

AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.

Each benchmark has its own scoring method — accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.

A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.