AI Benchmarks - LLM Performance Scores

Last updated: just now

Real benchmark scores from official model cards and third-party evaluations. Compare 49 models across 20 benchmarks - from MMLU and GPQA Diamond to SWE-bench and Arena Elo. Filter by category, model type, and switch between chart and matrix views.

Benchmark Categories

Jump directly into the strongest benchmark clusters instead of starting from the full matrix.

knowledge

3 benchmarks

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

reasoning

4 benchmarks

One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.

math

3 benchmarks

Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

coding

6 benchmarks

The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.

instruction

2 benchmarks

Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.

arena

2 benchmarks

The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Model type:

Massive Multitask Language Understanding

Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.

Metric: accuracy %Human baseline: 89.8%Saturated

Why it matters

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

MMLU Scores (46 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇GPT-5.4	OpenAI	standard	94.0%	Elite
2	🥈GPT-5.2	OpenAI	standard	93.5%	Elite
3	🥉GPT-5	OpenAI	standard	93.0%	Elite
4	Gemini 3 Pro	Google	hybrid	92.5%	Elite
5	o3	OpenAI	reasoning	92.3%	Elite
6	Claude Opus 4.6	Anthropic	hybrid	92.1%	Elite
7	o1	OpenAI	reasoning	91.8%	Elite
8	DeepSeek R1-0528	DeepSeek	reasoning	91.5%	Elite
9	Grok 4	xAI	reasoning	91.5%	Elite
10	Claude Opus 4.5	Anthropic	hybrid	91.4%	Elite
11	Claude Sonnet 4.6	Anthropic	hybrid	91.2%	Elite
12	Claude Opus 4	Anthropic	hybrid	91.0%	Elite
13	Gemini 2.5 Pro	Google	hybrid	90.8%	Elite
14	DeepSeek R1	DeepSeek	reasoning	90.8%	Elite
15	Claude Sonnet 4.5	Anthropic	hybrid	90.8%	Elite
16	Claude 3.7 Sonnet	Anthropic	hybrid	90.2%	Elite
17	Claude Sonnet 4	Anthropic	hybrid	89.5%	Strong
18	GPT-4.1	OpenAI	standard	89.2%	Strong
19	DeepSeek V3 (March 2025)	DeepSeek	standard	89.2%	Strong
20	GPT-4o	OpenAI	standard	88.7%	Strong
21	Claude 3.5 Sonnet	Anthropic	standard	88.7%	Strong
22	Llama 3.1 405B	Meta	standard	88.6%	Strong
23	DeepSeek V3	DeepSeek	standard	88.5%	Strong
24	Grok 3	xAI	standard	88.5%	Strong
25	Llama 4 Maverick	Meta	standard	88.0%	Strong
26	Gemini 3 Flash	Google	hybrid	88.0%	Strong
27	Grok 2	xAI	standard	87.5%	Strong
28	o3-mini	OpenAI	reasoning	86.9%	Strong
29	Claude 3 Opus	Anthropic	standard	86.8%	Strong
30	GPT-4 Turbo	OpenAI	standard	86.5%	Strong
31	Llama 3.3 70B	Meta	standard	86.3%	Strong
32	Qwen 2.5 72B	Alibaba	standard	86.1%	Strong
33	Llama 3.1 70B	Meta	standard	86.0%	Strong
34	Gemini 1.5 Pro	Google	standard	85.9%	Strong
35	Gemini 2.5 Flash	Google	hybrid	85.8%	Strong
36	o1-mini	OpenAI	reasoning	85.2%	Strong
37	Phi-4	Microsoft	standard	84.8%	Strong
38	Mistral Large 2	Mistral AI	standard	84.7%	Strong
39	Claude Haiku 4.5	Anthropic	standard	84.5%	Strong
40	Mistral Large 2	Mistral AI	standard	84.0%	Strong
41	GPT-4o mini	OpenAI	standard	82.0%	Strong
42	Claude 3.5 Haiku	Anthropic	standard	80.9%	Strong
43	Mixtral 8x22B	Mistral AI	standard	77.3%	Strong
44	Gemini 2.0 Flash	Google	standard	76.4%	Strong
45	Command R+	Cohere	standard	75.7%	Strong
46	Gemma 2 27B	Google	standard	75.2%	Strong

How to Read This Page

Performance Tiers

Elite - Top 10% of the score range

Strong - Top 25% of the score range

Good - Above the midpoint

Below Average - Below the midpoint

Model Types

Standard - Direct inference, no chain-of-thought

Reasoning - Extended thinking (o1, R1) - slower but excels on math/reasoning

Hybrid - Optional thinking mode (Claude 3.7, Gemini 2.5) - can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Beyond Benchmarks

Benchmarks tell part of the story. Our composite score combines real capabilities, pricing, context window, and more. Compare models head-to-head or explore the full leaderboard for a complete picture.

Composite Leaderboard|Compare Models|Scoring Methodology

SWE-bench Verified benchmark scores HumanEval code generation benchmark MMLU knowledge and reasoning scores GPQA Diamond graduate-level QA benchmark MATH-500 mathematical reasoning scores Arena Elo human preference ratings BigCodeBench coding benchmark results AIME 2024 competition math scores SimpleQA factual accuracy benchmark Best AI models for coding tasks Best AI models for reasoning Best AI models for math problems Best AI models for writing AI terminology and benchmark glossary Complete guide to comparing AI models

Frequently Asked Questions

AI benchmarks are standardized tests that measure how well AI models perform at specific tasks. Common benchmarks include MMLU (general knowledge), SWE-bench (coding), GPQA (science reasoning), MATH-500 (math), Arena Elo (human preference), and HumanEval (code generation).

No single benchmark captures the full picture. MMLU tests breadth of knowledge, SWE-bench tests real-world coding ability, and Arena Elo reflects human preferences. We recommend looking at multiple benchmarks together, which is why our composite score weighs several dimensions.

Our benchmark data is refreshed hourly from provider APIs and community evaluations. New benchmarks are added as they become industry-standard. Arena Elo ratings update continuously based on user votes.

Benchmarks are useful indicators but not perfect predictors. A model scoring well on MMLU may not be the best for creative writing, and high SWE-bench scores do not guarantee faster coding assistance. Real-world performance depends on your specific use case, prompt engineering, and integration approach.

AI Benchmarks - LLM Performance Scores

Benchmark Categories

knowledge

reasoning

math

coding

instruction

arena

Massive Multitask Language Understanding

MMLU Scores (46 models)

How to Read This Page

Beyond Benchmarks

Related Pages

AI Benchmarks - LLM Performance Scores

Benchmark Categories

knowledge

reasoning

math

coding

instruction

arena

Massive Multitask Language Understanding

MMLU Scores (46 models)

How to Read This Page

Beyond Benchmarks

Related Pages