AI models ranked by coding ability using SWE-bench Verified, HumanEval, and BigCodeBench scores. Fallback to Arena Elo for unbenched models.
Grok 4.20 Beta
Score: 99.3
77.8
Across all ranked models
95
With benchmark data
| # | Model | Score |
|---|---|---|
| 1 | Grok 4.20 Beta(fallback)xAI | 99.3 |
| 2 | Gemini 3.1 Pro Preview(fallback)Google | 98.7 |
| 3 | GPT-5.2 Chat(fallback)OpenAI | 96.8 |
| 4 | Grok 4.1 Fast(fallback)xAI | 95.5 |
| 5 | GPT-5.1(fallback)OpenAI | 92.7 |
| 6 | Gemini 3 Flash PreviewGoogle | 92 |
| 7 | Mistral LargeMistral AI | 92 |
| 8 | Qwen3.5 397B A17B(fallback)Alibaba | 91.7 |
| 9 | Claude Opus 4.1(fallback)Anthropic | 91.5 |
| 10 | Grok 3xAI | 90.5 |
| 11 | GPT-4oOpenAI | 90.2 |
| 12 | Claude Haiku 4.5Anthropic | 89.8 |
| 13 | Gemini 3.1 Flash Lite Preview(fallback)Google | 89.5 |
| 14 | Llama 4 MaverickMeta | 89.5 |
| 15 | Gemini 2.0 FlashGoogle | 89.4 |
| 16 | Llama 3.3 70B InstructMeta | 88.4 |
| 17 | Claude 3.5 HaikuAnthropic | 88.1 |
| 18 | GPT-5 Chat(fallback)OpenAI | 87.8 |
| 19 | GPT-5.4OpenAI | 87.5 |
| 20 | DeepSeek V3.2 Exp(fallback)DeepSeek | 87.3 |
| 21 | DeepSeek V3.2(fallback)DeepSeek | 87.2 |
| 22 | GPT-4o-miniOpenAI | 87.2 |
| 23 | GPT-4 TurboOpenAI | 87.1 |
| 24 | Claude Opus 4.5Anthropic | 87 |
| 25 | Grok 4 Fast(fallback)xAI | 87 |
| 26 | Qwen3.5-122B-A10B(fallback)Alibaba | 86.5 |
| 27 | DeepSeek V3.1(fallback)DeepSeek | 86.5 |
| 28 | DeepSeek V3.1 Terminus(fallback)DeepSeek | 86.2 |
| 29 | GPT-5.2OpenAI | 86.1 |
| 30 | Qwen3 VL 235B A22B Instruct(fallback)Alibaba | 86 |
| 31 | Qwen3.5-27B(fallback)Alibaba | 85 |
| 32 | DeepSeek V3 0324DeepSeek | 84.5 |
| 33 | GPT-5OpenAI | 84.2 |
| 34 | MiniMax M2.5(fallback)MiniMax | 84 |
| 35 | Claude Opus 4.6Anthropic | 83.9 |
| 36 | Qwen3 Next 80B A3B Instruct(fallback)Alibaba | 83.7 |
| 37 | LongCat Flash Chat(fallback)Meituan | 83.5 |
| 38 | Qwen3.5-Flash(fallback)Alibaba | 83.3 |
| 39 | Qwen3.5-35B-A3B(fallback)Alibaba | 83 |
| 40 | Qwen3 VL 235B A22B Thinking(fallback)Alibaba | 82.7 |
| 41 | Phi 4Microsoft | 82.6 |
| 42 | DeepSeek V3DeepSeek | 82.6 |
| 43 | Claude Opus 4Anthropic | 82.1 |
| 44 | GPT-5 Mini(fallback)OpenAI | 81.8 |
| 45 | Step 3.5 Flash(fallback)StepFun | 81.5 |
| 46 | Claude 3.7 Sonnet (thinking)(fallback)Anthropic | 81.3 |
| 47 | o3OpenAI | 81.1 |
| 48 | Grok 4xAI | 80.9 |
| 49 | Claude 3.7 SonnetAnthropic | 80.5 |
| 50 | Llama 3.1 70B InstructMeta | 80.5 |
| 51 | Gemini 3 Pro PreviewGoogle | 80.3 |
| 52 | o4 MiniOpenAI | 79.6 |
| 53 | Claude Sonnet 4.5Anthropic | 79.4 |
| 54 | Claude Sonnet 4.6Anthropic | 78.9 |
| 55 | Qwen3 Next 80B A3B Thinking(fallback)Alibaba | 78.2 |
| 56 | MiniMax M1(fallback)MiniMax | 77.8 |
| 57 | o3 Mini High(fallback)OpenAI | 77.3 |
| 58 | Grok 3 Mini Beta(fallback)xAI | 76.3 |
| 59 | Claude Sonnet 4Anthropic | 75.9 |
| 60 | gpt-oss-120b(fallback)OpenAI | 75.8 |
| 61 | Command A(fallback)Cohere | 75.7 |
| 62 | MiniMax M2(fallback)MiniMax | 74.7 |
| 63 | Qwen3 8B(fallback)Alibaba | 74.7 |
| 64 | GPT-4o (2024-05-13)(fallback)OpenAI | 74.3 |
| 65 | Llama 3.3 Nemotron Super 49B V1.5(fallback)NVIDIA | 73.7 |
| 66 | GPT-5 Nano(fallback)OpenAI | 73 |
| 67 | Nova 2 Lite(fallback)Amazon | 73 |
| 68 | QwQ 32B(fallback)Alibaba | 72.7 |
| 69 | GPT-4o (2024-08-06)(fallback)OpenAI | 72.5 |
| 70 | Olmo 3.1 32B Instruct(fallback)Allen AI | 71.8 |
| 71 | GPT-4.1OpenAI | 70.4 |
| 72 | GPT-4o-mini (2024-07-18)(fallback)OpenAI | 69.7 |
| 73 | gpt-oss-20b(fallback)OpenAI | 69.7 |
| 74 | Gemma 2 27BGoogle | 69.5 |
| 75 | Claude 3.5 SonnetAnthropic | 69.2 |
| 76 | Mistral Large 2407(fallback)Mistral AI | 69 |
| 77 | Mercury(fallback)Inception | 68.2 |
| 78 | Olmo 3 32B Think(fallback)Allen AI | 67.8 |
| 79 | o1OpenAI | 67.5 |
| 80 | Qwen2.5 72B Instruct(fallback)Alibaba | 67.2 |
| 81 | Llama 3.1 Nemotron 70B Instruct(fallback)NVIDIA | 66.5 |
| 82 | Olmo 3.1 32B Think(fallback)Allen AI | 64.2 |
| 83 | Gemini 2.5 ProGoogle | 63.8 |
| 84 | Llama 3 70B Instruct(fallback)Meta | 62.7 |
| 85 | Gemini 2.5 FlashGoogle | 62.6 |
| 86 | Qwen2.5 Coder 32B Instruct(fallback)Alibaba | 61.8 |
| 87 | Claude 3 Haiku(fallback)Anthropic | 60.2 |
| 88 | Command R+ (08-2024)(fallback)Cohere | 58.3 |
| 89 | R1 0528DeepSeek | 57.6 |
| 90 | Llama 3 8B Instruct(fallback)Meta | 53.8 |
| 91 | Llama 3.1 8B Instruct(fallback)Meta | 52 |
| 92 | o3 MiniOpenAI | 49.3 |
| 93 | R1DeepSeek | 49.2 |
| 94 | Llama 3.2 3B Instruct(fallback)Meta | 44.5 |
| 95 | Llama 3.2 1B Instruct(fallback)Meta | 35.2 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. Models without any primary benchmark data fall back to Arena Elo (normalized to 0-100) and are marked accordingly. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, Grok 4.20 Beta by xAI is currently the #1 ranked model for coding, with a weighted score of 99.3/100.
Models are ranked using a weighted average of SWE-bench Verified, HumanEval, BigCodeBench benchmark scores. Models without primary benchmark data fall back to Arena Elo. All scores are normalized to a 0-100 scale.
We currently rank 95 models that have relevant benchmark data for coding tasks.