Real benchmark scores from official model cards and third-party evaluations. Compare 49 models across 20 benchmarks - from MMLU and GPQA Diamond to SWE-bench and Arena Elo. Filter by category, model type, and switch between chart and matrix views.
Jump directly into the strongest benchmark clusters instead of starting from the full matrix.
3 benchmarks
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
4 benchmarks
One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.
3 benchmarks
Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
6 benchmarks
The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.
2 benchmarks
Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.
2 benchmarks
The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.
Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.
Why it matters
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
| # | Model | Score |
|---|---|---|
| 1 | 🥇GPT-5.4 | 94.0% |
| 2 | 🥈GPT-5.2 | 93.5% |
| 3 | 🥉GPT-5 | 93.0% |
| 4 | Gemini 3 Pro | 92.5% |
| 5 | o3 | 92.3% |
| 6 | Claude Opus 4.6 | 92.1% |
| 7 | o1 | 91.8% |
| 8 | DeepSeek R1-0528 | 91.5% |
| 9 | Grok 4 | 91.5% |
| 10 | Claude Opus 4.5 | 91.4% |
| 11 | Claude Sonnet 4.6 | 91.2% |
| 12 | Claude Opus 4 | 91.0% |
| 13 | Gemini 2.5 Pro | 90.8% |
| 14 | DeepSeek R1 | 90.8% |
| 15 | Claude Sonnet 4.5 | 90.8% |
| 16 | Claude 3.7 Sonnet | 90.2% |
| 17 | Claude Sonnet 4 | 89.5% |
| 18 | GPT-4.1 | 89.2% |
| 19 | DeepSeek V3 (March 2025) | 89.2% |
| 20 | GPT-4o | 88.7% |
| 21 | Claude 3.5 Sonnet | 88.7% |
| 22 | Llama 3.1 405B | 88.6% |
| 23 | DeepSeek V3 | 88.5% |
| 24 | Grok 3 | 88.5% |
| 25 | Llama 4 Maverick | 88.0% |
| 26 | Gemini 3 Flash | 88.0% |
| 27 | Grok 2 | 87.5% |
| 28 | o3-mini | 86.9% |
| 29 | Claude 3 Opus | 86.8% |
| 30 | GPT-4 Turbo | 86.5% |
| 31 | Llama 3.3 70B | 86.3% |
| 32 | Qwen 2.5 72B | 86.1% |
| 33 | Llama 3.1 70B | 86.0% |
| 34 | Gemini 1.5 Pro | 85.9% |
| 35 | Gemini 2.5 Flash | 85.8% |
| 36 | o1-mini | 85.2% |
| 37 | Phi-4 | 84.8% |
| 38 | Mistral Large 2 | 84.7% |
| 39 | Claude Haiku 4.5 | 84.5% |
| 40 | Mistral Large 2 | 84.0% |
| 41 | GPT-4o mini | 82.0% |
| 42 | Claude 3.5 Haiku | 80.9% |
| 43 | Mixtral 8x22B | 77.3% |
| 44 | Gemini 2.0 Flash | 76.4% |
| 45 | Command R+ | 75.7% |
| 46 | Gemma 2 27B | 75.2% |
Performance Tiers
Model Types
Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.
Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.
Benchmarks tell part of the story. Our composite score combines real capabilities, pricing, context window, and more. Compare models head-to-head or explore the full leaderboard for a complete picture.
AI benchmarks are standardized tests that measure how well AI models perform at specific tasks. Common benchmarks include MMLU (general knowledge), SWE-bench (coding), GPQA (science reasoning), MATH-500 (math), Arena Elo (human preference), and HumanEval (code generation).
No single benchmark captures the full picture. MMLU tests breadth of knowledge, SWE-bench tests real-world coding ability, and Arena Elo reflects human preferences. We recommend looking at multiple benchmarks together, which is why our composite score weighs several dimensions.
Our benchmark data is refreshed hourly from provider APIs and community evaluations. New benchmarks are added as they become industry-standard. Arena Elo ratings update continuously based on user votes.
Benchmarks are useful indicators but not perfect predictors. A model scoring well on MMLU may not be the best for creative writing, and high SWE-bench scores do not guarantee faster coding assistance. Real-world performance depends on your specific use case, prompt engineering, and integration approach.