Benchmark category

Coding Benchmarks

Compare top models across the benchmark suite that best represents coding performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.

Benchmarks in category

Models with coverage

Benchmarks with human baseline

Saturated benchmarks

What Is Included

The current benchmark set in this category, with context on what each test captures.

All benchmarks

HumanEval Code Generation

Saturated

The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.

pass@1 %

Software Engineering Benchmark (Verified)

The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.

resolve rate %

BigCodeBench (Hard)

More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.

pass@1 %

CursorBench (Multi-file Editing)

The first benchmark designed specifically for agentic coding assistants that edit multiple files. More realistic than single-function benchmarks like HumanEval.

pass rate %

Terminal-Bench (Terminal Agent Tasks)

Measures agentic capability in terminal environments — critical for AI coding assistants that execute commands and manage development workflows.

pass rate %

SWE-bench Multilingual

Most real codebases are polyglot. This benchmark tests whether coding models can handle the diversity of languages seen in production software engineering.

resolved rate %

Model type:

HumanEval Code Generation

164 Python function-generation problems where models must write correct code from docstrings, tested against unit tests. The original code generation benchmark.

Metric: pass@1 %Saturated

Why it matters

The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.

HumanEval Scores (44 models)

Standard Reasoning Hybrid

#	Model	Provider	Type	Score	Tier
1	🥇GPT-5.4	OpenAI	standard	97.5%	Elite
2	🥈o3	OpenAI	reasoning	97.0%	Elite
3	🥉GPT-5.2	OpenAI	standard	97.0%	Elite
4	GPT-5	OpenAI	standard	96.5%	Elite
5	Claude Opus 4.6	Anthropic	hybrid	96.0%	Elite
6	Gemini 3 Pro	Google	hybrid	96.0%	Elite
7	Grok 4	xAI	reasoning	95.5%	Elite
8	Claude Opus 4.5	Anthropic	hybrid	95.2%	Elite
9	Claude Sonnet 4.6	Anthropic	hybrid	95.2%	Elite
10	Claude Opus 4	Anthropic	hybrid	95.0%	Elite
11	o4-mini	OpenAI	reasoning	95.0%	Elite
12	Claude Sonnet 4.5	Anthropic	hybrid	94.5%	Elite
13	Claude 3.7 Sonnet	Anthropic	hybrid	94.0%	Elite
14	Claude Sonnet 4	Anthropic	hybrid	93.8%	Elite
15	Claude 3.5 Sonnet	Anthropic	standard	93.7%	Elite
16	Qwen 2.5 Coder 32B	Alibaba	standard	92.7%	Elite
17	o1	OpenAI	reasoning	92.4%	Elite
18	o1-mini	OpenAI	reasoning	92.4%	Elite
19	Mistral Large 2	Mistral AI	standard	92.0%	Elite
20	Mistral Large 2	Mistral AI	standard	92.0%	Elite
21	Gemini 3 Flash	Google	hybrid	92.0%	Elite
22	GPT-4.1	OpenAI	standard	91.5%	Elite
23	Grok 3	xAI	standard	90.5%	Elite
24	GPT-4o	OpenAI	standard	90.2%	Elite
25	Gemini 2.5 Flash	Google	hybrid	90.0%	Elite
26	Claude Haiku 4.5	Anthropic	standard	89.8%	Strong
27	Llama 4 Maverick	Meta	standard	89.5%	Strong
28	Gemini 2.0 Flash	Google	standard	89.4%	Strong
29	Llama 3.1 405B	Meta	standard	89.0%	Strong
30	Llama 3.3 70B	Meta	standard	88.4%	Strong
31	Claude 3.5 Haiku	Anthropic	standard	88.1%	Strong
32	GPT-4o mini	OpenAI	standard	87.2%	Strong
33	GPT-4 Turbo	OpenAI	standard	87.1%	Strong
34	Qwen 2.5 72B	Alibaba	standard	86.6%	Strong
35	Claude 3 Opus	Anthropic	standard	84.9%	Strong
36	DeepSeek V3 (March 2025)	DeepSeek	standard	84.5%	Strong
37	Gemini 1.5 Pro	Google	standard	84.1%	Strong
38	Grok 2	xAI	standard	84.0%	Strong
39	DeepSeek V3	DeepSeek	standard	82.6%	Strong
40	Phi-4	Microsoft	standard	82.6%	Strong
41	Llama 3.1 70B	Meta	standard	80.5%	Strong
42	Mixtral 8x22B	Mistral AI	standard	78.6%	Strong
43	Command R+	Cohere	standard	74.3%	Good
44	Gemma 2 27B	Google	standard	69.5%	Good

How to Read This Page

Performance Tiers

Elite - Top 10% of the score range

Strong - Top 25% of the score range

Good - Above the midpoint

Below Average - Below the midpoint

Model Types

Standard - Direct inference, no chain-of-thought

Reasoning - Extended thinking (o1, R1) - slower but excels on math/reasoning

Hybrid - Optional thinking mode (Claude 3.7, Gemini 2.5) - can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.

Frequently Asked Questions

AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.

Each benchmark has its own scoring method - accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.

A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.