Benchmark category

Knowledge Benchmarks

Compare top models across the benchmark suite that best represents knowledge performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.

Benchmarks in category

Models with coverage

Benchmarks with human baseline

Saturated benchmarks

What Is Included

The current benchmark set in this category, with context on what each test captures.

All benchmarks

Massive Multitask Language Understanding

Saturated

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

accuracy %Human 89.8%

MMLU Professional

Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.

accuracy %

SimpleQA Factual Accuracy

GPT-4o scores below 40%, making it surprisingly challenging. Tests honesty and factual reliability, not just knowledge breadth.

accuracy %

Model type:

Massive Multitask Language Understanding

Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.

Metric: accuracy %Human baseline: 89.8%Saturated

Why it matters

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

MMLU Scores (24 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇o1	OpenAI	reasoning	91.8%	Elite
2	🥈Claude Opus 4.5	Anthropic	hybrid	91.4%	Elite
3	🥉Gemini 2.5 Pro	Google	hybrid	90.8%	Elite
4	DeepSeek R1	DeepSeek	reasoning	90.8%	Elite
5	Claude 3.7 Sonnet	Anthropic	hybrid	90.2%	Elite
6	GPT-4o	OpenAI	standard	88.7%	Strong
7	Claude 3.5 Sonnet	Anthropic	standard	88.7%	Strong
8	Llama 3.1 405B	Meta	standard	88.6%	Strong
9	DeepSeek V3	DeepSeek	standard	88.5%	Strong
10	Grok 2	xAI	standard	87.5%	Strong
11	o3-mini	OpenAI	reasoning	86.9%	Strong
12	Claude 3 Opus	Anthropic	standard	86.8%	Strong
13	GPT-4 Turbo	OpenAI	standard	86.5%	Strong
14	Llama 3.3 70B	Meta	standard	86.3%	Strong
15	Qwen 2.5 72B	Alibaba	standard	86.1%	Strong
16	Llama 3.1 70B	Meta	standard	86.0%	Strong
17	Gemini 1.5 Pro	Google	standard	85.9%	Strong
18	o1-mini	OpenAI	reasoning	85.2%	Strong
19	Mistral Large 2	Mistral AI	standard	84.0%	Strong
20	GPT-4o mini	OpenAI	standard	82.0%	Strong
21	Claude 3.5 Haiku	Anthropic	standard	80.9%	Strong
22	Mixtral 8x22B	Mistral AI	standard	77.3%	Strong
23	Gemini 2.0 Flash	Google	standard	76.4%	Strong
24	Command R+	Cohere	standard	75.7%	Strong

How to Read This Page

Performance Tiers

Elite — Top 10% of the score range

Strong — Top 25% of the score range

Good — Above the midpoint

Below Average — Below the midpoint

Model Types

Standard — Direct inference, no chain-of-thought

Reasoning — Extended thinking (o1, R1) — slower but excels on math/reasoning

Hybrid — Optional thinking mode (Claude 3.7, Gemini 2.5) — can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Frequently Asked Questions

AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.

Each benchmark has its own scoring method — accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.

A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.