Detailed benchmark results for the top AI models across coding, reasoning, math, and general conversation quality. Each benchmark tests a different capability — no single model dominates them all.
Evaluates code generation, bug fixing, and software engineering capabilities.
Real-world GitHub issues that require understanding codebases, diagnosing bugs, and writing correct patches across multiple files.
| # | Model | Resolved % |
|---|---|---|
| 1 | 🥇Claude Opus 4.6 | 72.0% |
| 2 | 🥈GPT-5.2 | 69.8% |
| 3 | 🥉o3 | 69.1% |
| 4 | Claude Sonnet 4.6 | 65.3% |
| 5 | Gemini 3 Pro | 63.8% |
| 6 | Grok 4.1 | 61.2% |
| 7 | DeepSeek V3.1 | 58.5% |
| 8 | Qwen 3.5 397B | 55.7% |
| 9 | Llama 4 Maverick | 52.3% |
| 10 | Mistral Large 3 | 49.8% |
164 hand-written Python programming problems testing function-level code generation from docstrings.
| # | Model | Pass@1 |
|---|---|---|
| 1 | 🥇Claude Opus 4.6 | 96.3% |
| 2 | 🥈GPT-5.2 | 95.7% |
| 3 | 🥉o3 | 95.2% |
| 4 | Gemini 3 Pro | 94.5% |
| 5 | Claude Sonnet 4.6 | 93.9% |
| 6 | Grok 4.1 | 93.1% |
| 7 | DeepSeek V3.1 | 91.5% |
| 8 | Qwen 3.5 397B | 90.8% |
| 9 | Llama 4 Maverick | 89.6% |
| 10 | Mistral Large 3 | 88.4% |
974 crowd-sourced Python programming problems designed to test basic programming ability.
| # | Model | Pass@1 |
|---|---|---|
| 1 | 🥇GPT-5.2 | 91.2% |
| 2 | 🥈Claude Opus 4.6 | 90.8% |
| 3 | 🥉o3 | 90.1% |
| 4 | Claude Sonnet 4.6 | 89.3% |
| 5 | Gemini 3 Pro | 88.7% |
| 6 | Grok 4.1 | 87.4% |
| 7 | DeepSeek V3.1 | 86.9% |
| 8 | Qwen 3.5 397B | 85.2% |
| 9 | Llama 4 Maverick | 84.1% |
| 10 | DeepSeek R1 | 83.6% |
Tests knowledge breadth, logical reasoning, and scientific understanding.
15,000+ multiple-choice questions across 57 academic subjects spanning STEM, humanities, and social sciences.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇GPT-5.2 | 92.4% |
| 2 | 🥈Claude Opus 4.6 | 92.1% |
| 3 | 🥉Gemini 3 Pro | 91.3% |
| 4 | o3 | 90.8% |
| 5 | Claude Sonnet 4.6 | 89.5% |
| 6 | Grok 4.1 | 89.1% |
| 7 | Qwen 3.5 397B | 88.2% |
| 8 | Llama 4 Maverick | 87.5% |
| 9 | DeepSeek V3.1 | 86.8% |
| 10 | Mistral Large 3 | 85.9% |
Science exam questions requiring reasoning beyond simple retrieval. Tests compositional and multi-step reasoning.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇Claude Opus 4.6 | 96.8% |
| 2 | 🥈GPT-5.2 | 96.5% |
| 3 | 🥉o3 | 96.1% |
| 4 | Gemini 3 Pro | 95.4% |
| 5 | Grok 4.1 | 94.7% |
| 6 | Claude Sonnet 4.6 | 94.2% |
| 7 | Qwen 3.5 397B | 93.1% |
| 8 | DeepSeek R1 | 92.5% |
| 9 | Llama 4 Maverick | 91.8% |
| 10 | DeepSeek V3.1 | 91.3% |
Tests commonsense reasoning about physical situations and everyday activities via sentence completion.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇GPT-5.2 | 97.2% |
| 2 | 🥈Claude Opus 4.6 | 97.0% |
| 3 | 🥉Gemini 3 Pro | 96.7% |
| 4 | o3 | 96.3% |
| 5 | Grok 4.1 | 95.9% |
| 6 | Claude Sonnet 4.6 | 95.5% |
| 7 | Llama 4 Maverick | 95.0% |
| 8 | Qwen 3.5 397B | 94.6% |
| 9 | DeepSeek V3.1 | 94.1% |
| 10 | Mistral Large 3 | 93.5% |
Graduate-level science questions written by domain experts. Designed to be extremely difficult even for specialists.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇o3 | 79.3% |
| 2 | 🥈Claude Opus 4.6 | 77.8% |
| 3 | 🥉GPT-5.2 | 76.5% |
| 4 | Gemini 3 Pro | 74.2% |
| 5 | DeepSeek R1 | 72.1% |
| 6 | Grok 4.1 | 70.8% |
| 7 | Claude Sonnet 4.6 | 69.5% |
| 8 | Qwen 3.5 397B | 67.3% |
| 9 | Llama 4 Maverick | 64.8% |
| 10 | Mistral Large 3 | 62.1% |
Evaluates mathematical reasoning from grade school to competition-level problems.
12,500 competition-level mathematics problems covering algebra, geometry, number theory, combinatorics, and calculus.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇o3 | 96.7% |
| 2 | 🥈Claude Opus 4.6 | 94.2% |
| 3 | 🥉DeepSeek R1 | 93.8% |
| 4 | GPT-5.2 | 93.1% |
| 5 | Gemini 3 Pro | 91.5% |
| 6 | Qwen 3.5 397B | 89.4% |
| 7 | Grok 4.1 | 88.7% |
| 8 | Claude Sonnet 4.6 | 87.9% |
| 9 | Llama 4 Maverick | 85.2% |
| 10 | Mistral Large 3 | 82.6% |
8,500 grade-school-level math word problems requiring multi-step arithmetic reasoning.
| # | Model | Accuracy |
|---|---|---|
| 1 | 🥇o3 | 98.5% |
| 2 | 🥈GPT-5.2 | 97.8% |
| 3 | 🥉Claude Opus 4.6 | 97.5% |
| 4 | Gemini 3 Pro | 97.1% |
| 5 | DeepSeek R1 | 96.9% |
| 6 | Claude Sonnet 4.6 | 96.4% |
| 7 | Grok 4.1 | 95.8% |
| 8 | Qwen 3.5 397B | 95.2% |
| 9 | Llama 4 Maverick | 94.5% |
| 10 | DeepSeek V3.1 | 93.9% |
Human preference evaluations and conversational quality assessments.
Crowdsourced blind comparison where users chat with two anonymous models and vote for the better response. Elo ratings reflect aggregate human preferences.
| # | Model | Elo Rating |
|---|---|---|
| 1 | 🥇GPT-5.2 | 1,387 |
| 2 | 🥈Claude Opus 4.6 | 1,381 |
| 3 | 🥉Gemini 3 Pro | 1,362 |
| 4 | Grok 4.1 | 1,348 |
| 5 | Claude Sonnet 4.6 | 1,335 |
| 6 | o3 | 1,328 |
| 7 | Llama 4 Maverick | 1,312 |
| 8 | DeepSeek V3.1 | 1,298 |
| 9 | Qwen 3.5 397B | 1,285 |
| 10 | Mistral Large 3 | 1,271 |
Multi-turn conversation benchmark with 80 questions across writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities.
| # | Model | Score (1-10) |
|---|---|---|
| 1 | 🥇Claude Opus 4.6 | 9.72 |
| 2 | 🥈GPT-5.2 | 9.68 |
| 3 | 🥉Gemini 3 Pro | 9.55 |
| 4 | o3 | 9.48 |
| 5 | Grok 4.1 | 9.41 |
| 6 | Claude Sonnet 4.6 | 9.35 |
| 7 | DeepSeek V3.1 | 9.22 |
| 8 | Qwen 3.5 397B | 9.15 |
| 9 | Llama 4 Maverick | 9.08 |
| 10 | Mistral Large 3 | 8.95 |
Benchmarks are one signal among many. Our composite leaderboard combines benchmark scores with Elo ratings, community sentiment, adoption metrics, pricing efficiency, and response speed for a holistic ranking.