AI models ranked by coding ability using SWE-bench Verified, HumanEval, and BigCodeBench scores. Fallback to Arena Elo for unbenched models.
Mistral Large
Score: 92
70.2
Across all ranked models
58
With benchmark data
| # | Model | Score |
|---|---|---|
| 1 | Mistral LargeMistral AI | 92 |
| 2 | Gemini 2.5 Pro Preview 05-06(fallback)Google | 91 |
| 3 | GPT-4oOpenAI | 90.2 |
| 4 | Gemini 2.0 FlashGoogle | 89.4 |
| 5 | Llama 3.3 70B InstructMeta | 88.4 |
| 6 | Claude 3.5 HaikuAnthropic | 88.1 |
| 7 | o3(fallback)OpenAI | 88 |
| 8 | Qwen3.5 397B A17B(fallback)Alibaba | 87.8 |
| 9 | Grok 3(fallback)xAI | 87.2 |
| 10 | GPT-4o-miniOpenAI | 87.2 |
| 11 | GPT-4 TurboOpenAI | 87.1 |
| 12 | Claude Opus 4.5Anthropic | 87 |
| 13 | Llama 4 Maverick(fallback)Meta | 86 |
| 14 | Qwen3 235B A22B Thinking 2507(fallback)Alibaba | 85.5 |
| 15 | Gemini 2.5 Flash(fallback)Google | 85.2 |
| 16 | DeepSeek V3DeepSeek | 82.6 |
| 17 | Claude 3.7 SonnetAnthropic | 80.5 |
| 18 | Llama 3.1 70B InstructMeta | 80.5 |
| 19 | GPT-4o (2024-11-20)(fallback)OpenAI | 80.2 |
| 20 | GPT-4.1(fallback)OpenAI | 80 |
| 21 | Qwen3 30B A3B Instruct 2507(fallback)Alibaba | 80 |
| 22 | DeepSeek V3 0324(fallback)DeepSeek | 79.3 |
| 23 | Mistral Medium 3(fallback)Mistral AI | 78.3 |
| 24 | Claude Opus 4.6(fallback)Anthropic | 77.7 |
| 25 | Qwen3 235B A22B(fallback)Alibaba | 77.5 |
| 26 | Grok 3 Mini Beta(fallback)xAI | 76.7 |
| 27 | MiniMax M1(fallback)MiniMax | 75.7 |
| 28 | Qwen3 32B(fallback)Alibaba | 73.5 |
| 29 | Llama 3.3 Nemotron Super 49B V1.5(fallback)NVIDIA | 73.2 |
| 30 | o3 Mini High(fallback)OpenAI | 73 |
| 31 | Claude Sonnet 4.6(fallback)Anthropic | 73 |
| 32 | QwQ 32B(fallback)Alibaba | 71.8 |
| 33 | Gemini 2.0 Flash Lite(fallback)Google | 71.7 |
| 34 | Qwen3 30B A3B(fallback)Alibaba | 70 |
| 35 | Claude 3.5 SonnetAnthropic | 69.2 |
| 36 | o1OpenAI | 67.5 |
| 37 | GPT-4o (2024-05-13)(fallback)OpenAI | 67 |
| 38 | GPT-4o-mini (2024-07-18)(fallback)OpenAI | 64.7 |
| 39 | GPT-4o (2024-08-06)(fallback)OpenAI | 64.2 |
| 40 | Llama 3.1 Nemotron 70B Instruct(fallback)NVIDIA | 64 |
| 41 | Gemini 2.5 ProGoogle | 63.8 |
| 42 | Qwen2.5 72B Instruct(fallback)Alibaba | 62 |
| 43 | Mistral Large 2407(fallback)Mistral AI | 61.5 |
| 44 | Nova Pro 1.0(fallback)Amazon | 60.2 |
| 45 | Qwen2.5 Coder 32B Instruct(fallback)Alibaba | 55.7 |
| 46 | Command R+ (08-2024)(fallback)Cohere | 55.5 |
| 47 | Nova Lite 1.0(fallback)Amazon | 55.5 |
| 48 | Llama 3 70B Instruct(fallback)Meta | 54 |
| 49 | Phi 4(fallback)Microsoft | 53.7 |
| 50 | Nova Micro 1.0(fallback)Amazon | 52.3 |
| 51 | Claude 3 Haiku(fallback)Anthropic | 50 |
| 52 | o3 MiniOpenAI | 49.3 |
| 53 | R1DeepSeek | 49.2 |
| 54 | Llama 3.1 8B Instruct(fallback)Meta | 48.8 |
| 55 | Llama 3 8B Instruct(fallback)Meta | 45.2 |
| 56 | Llama 3.2 3B Instruct(fallback)Meta | 36.2 |
| 57 | Llama 3.2 1B Instruct(fallback)Meta | 27.8 |
| 58 | Mistral 7B Instruct v0.1(fallback)Mistral AI | 21.8 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. Models without any primary benchmark data fall back to Arena Elo (normalized to 0-100) and are marked accordingly. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, Mistral Large by Mistral AI is currently the #1 ranked model for coding, with a weighted score of 92/100.
Models are ranked using a weighted average of SWE-bench Verified, HumanEval, BigCodeBench benchmark scores. Models without primary benchmark data fall back to Arena Elo. All scores are normalized to a 0-100 scale.
We currently rank 58 models that have relevant benchmark data for coding tasks.