Benchmark category

Arena Benchmarks

Compare top models across the benchmark suite that best represents arena performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.

Benchmarks in category

Models with coverage

Benchmarks with human baseline

Saturated benchmarks

What Is Included

The current benchmark set in this category, with context on what each test captures.

All benchmarks

LMSYS Chatbot Arena Elo Rating

The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Elo rating

LiveBench (Dynamic)

Contamination-free by design — uses new questions regularly. Top models still score below 70%, making it highly discriminating.

average score %

Model type:

LMSYS Chatbot Arena Elo Rating

Human preference rating from 6M+ crowdsourced blind head-to-head comparisons. Users chat with two anonymous models and pick the better response.

Metric: Elo rating

Why it matters

The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Arena Elo Scores (23 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇Gemini 2.5 Pro	Google	hybrid	1444	Strong
2	🥈Claude Opus 4.5	Anthropic	hybrid	1430	Strong
3	🥉o3-mini	OpenAI	reasoning	1371	Good
4	DeepSeek R1	DeepSeek	reasoning	1369	Good
5	Claude 3.7 Sonnet	Anthropic	hybrid	1354	Good
6	Gemini 2.0 Flash	Google	standard	1352	Good
7	o1	OpenAI	reasoning	1350	Good
8	DeepSeek V3	DeepSeek	standard	1318	Good
9	o1-mini	OpenAI	reasoning	1304	Good
10	GPT-4o	OpenAI	standard	1286	Good
11	Gemini 1.5 Pro	Google	standard	1281	Good
12	Claude 3.5 Sonnet	Anthropic	standard	1271	Good
13	Grok 2	xAI	standard	1262	Good
14	Qwen 2.5 72B	Alibaba	standard	1261	Good
15	GPT-4 Turbo	OpenAI	standard	1255	Good
16	Mistral Large 2	Mistral AI	standard	1250	Good
17	Llama 3.3 70B	Meta	standard	1243	Below Avg
18	Claude 3 Opus	Anthropic	standard	1232	Below Avg
19	Llama 3.1 405B	Meta	standard	1229	Below Avg
20	GPT-4o mini	OpenAI	standard	1222	Below Avg
21	Llama 3.1 70B	Meta	standard	1198	Below Avg
22	Claude 3.5 Haiku	Anthropic	standard	1178	Below Avg
23	Mixtral 8x22B	Mistral AI	standard	1146	Below Avg

How to Read This Page

Performance Tiers

Elite — Top 10% of the score range

Strong — Top 25% of the score range

Good — Above the midpoint

Below Average — Below the midpoint

Model Types

Standard — Direct inference, no chain-of-thought

Reasoning — Extended thinking (o1, R1) — slower but excels on math/reasoning

Hybrid — Optional thinking mode (Claude 3.7, Gemini 2.5) — can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Frequently Asked Questions

AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.

Each benchmark has its own scoring method — accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.

A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.