Compare top models across the benchmark suite that best represents coding performance. Use this page as the fastest way to inspect the relevant tests, then jump into the full matrix when you want broader context.
6
Benchmarks in category
49
Models with coverage
0
Benchmarks with human baseline
1
Saturated benchmarks
The current benchmark set in this category, with context on what each test captures.
The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.
The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.
More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.
The first benchmark designed specifically for agentic coding assistants that edit multiple files. More realistic than single-function benchmarks like HumanEval.
Measures agentic capability in terminal environments — critical for AI coding assistants that execute commands and manage development workflows.
Most real codebases are polyglot. This benchmark tests whether coding models can handle the diversity of languages seen in production software engineering.
164 Python function-generation problems where models must write correct code from docstrings, tested against unit tests. The original code generation benchmark.
Why it matters
The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.
| # | Model | Score |
|---|---|---|
| 1 | 🥇GPT-5.4 | 97.5% |
| 2 | 🥈o3 | 97.0% |
| 3 | 🥉GPT-5.2 | 97.0% |
| 4 | GPT-5 | 96.5% |
| 5 | Claude Opus 4.6 | 96.0% |
| 6 | Gemini 3 Pro | 96.0% |
| 7 | Grok 4 | 95.5% |
| 8 | Claude Opus 4.5 | 95.2% |
| 9 | Claude Sonnet 4.6 | 95.2% |
| 10 | Claude Opus 4 | 95.0% |
| 11 | o4-mini | 95.0% |
| 12 | Claude Sonnet 4.5 | 94.5% |
| 13 | Claude 3.7 Sonnet | 94.0% |
| 14 | Claude Sonnet 4 | 93.8% |
| 15 | Claude 3.5 Sonnet | 93.7% |
| 16 | Qwen 2.5 Coder 32B | 92.7% |
| 17 | o1 | 92.4% |
| 18 | o1-mini | 92.4% |
| 19 | Mistral Large 2 | 92.0% |
| 20 | Mistral Large 2 | 92.0% |
| 21 | Gemini 3 Flash | 92.0% |
| 22 | GPT-4.1 | 91.5% |
| 23 | Grok 3 | 90.5% |
| 24 | GPT-4o | 90.2% |
| 25 | Gemini 2.5 Flash | 90.0% |
| 26 | Claude Haiku 4.5 | 89.8% |
| 27 | Llama 4 Maverick | 89.5% |
| 28 | Gemini 2.0 Flash | 89.4% |
| 29 | Llama 3.1 405B | 89.0% |
| 30 | Llama 3.3 70B | 88.4% |
| 31 | Claude 3.5 Haiku | 88.1% |
| 32 | GPT-4o mini | 87.2% |
| 33 | GPT-4 Turbo | 87.1% |
| 34 | Qwen 2.5 72B | 86.6% |
| 35 | Claude 3 Opus | 84.9% |
| 36 | DeepSeek V3 (March 2025) | 84.5% |
| 37 | Gemini 1.5 Pro | 84.1% |
| 38 | Grok 2 | 84.0% |
| 39 | DeepSeek V3 | 82.6% |
| 40 | Phi-4 | 82.6% |
| 41 | Llama 3.1 70B | 80.5% |
| 42 | Mixtral 8x22B | 78.6% |
| 43 | Command R+ | 74.3% |
| 44 | Gemma 2 27B | 69.5% |
Performance Tiers
Model Types
Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.
Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.
AI benchmarks are grouped into categories like coding, math, reasoning, knowledge, and safety. Each category contains multiple standardized tests that measure specific aspects of model performance. This page focuses on one category so you can compare models within a specific skill area.
Each benchmark has its own scoring method - accuracy percentage, pass rate, Elo rating, or normalized score. We display raw scores from official evaluations and community-run tests. Scores are updated hourly as new evaluation results become available.
A saturated benchmark is one where top models score near the maximum (typically above 95%). This means the benchmark no longer effectively differentiates between the best models, and newer, harder benchmarks are needed to measure progress.