Best AI Models for Coding

AI models ranked by coding ability using SWE-bench Verified, HumanEval, and BigCodeBench scores. Fallback to Arena Elo for unbenched models.

Last updated: just now

#1 Model

Mistral Large

Score: 92

Average Score

70.2

Across all ranked models

Models Ranked

With benchmark data

Weights:SWE-bench Verified (40%)HumanEval (30%)BigCodeBench (30%)Fallback: Arena Elo

#	Model	Provider	Score	SWE-bench Verified	HumanEval	BigCodeBench	Arena Elo
1	Mistral LargeMistral AI	Mistral AI	92	--	92	--	--
2	Gemini 2.5 Pro Preview 05-06(fallback)Google	Google	91	--	--	--	91
3	GPT-4oOpenAI	OpenAI	90.2	--	90.2	--	--
4	Gemini 2.0 FlashGoogle	Google	89.4	--	89.4	--	--
5	Llama 3.3 70B InstructMeta	Meta	88.4	--	88.4	--	--
6	Claude 3.5 HaikuAnthropic	Anthropic	88.1	--	88.1	--	--
7	o3(fallback)OpenAI	OpenAI	88	--	--	--	88
8	Qwen3.5 397B A17B(fallback)Alibaba	Alibaba	87.8	--	--	--	87.8
9	Grok 3(fallback)xAI	xAI	87.2	--	--	--	87.2
10	GPT-4o-miniOpenAI	OpenAI	87.2	--	87.2	--	--
11	GPT-4 TurboOpenAI	OpenAI	87.1	--	87.1	--	--
12	Claude Opus 4.5Anthropic	Anthropic	87	80.9	95.2	--	--
13	Llama 4 Maverick(fallback)Meta	Meta	86	--	--	--	86
14	Qwen3 235B A22B Thinking 2507(fallback)Alibaba	Alibaba	85.5	--	--	--	85.5
15	Gemini 2.5 Flash(fallback)Google	Google	85.2	--	--	--	85.2
16	DeepSeek V3DeepSeek	DeepSeek	82.6	--	82.6	--	--
17	Claude 3.7 SonnetAnthropic	Anthropic	80.5	70.3	94	--	--
18	Llama 3.1 70B InstructMeta	Meta	80.5	--	80.5	--	--
19	GPT-4o (2024-11-20)(fallback)OpenAI	OpenAI	80.2	--	--	--	80.2
20	GPT-4.1(fallback)OpenAI	OpenAI	80	--	--	--	80
21	Qwen3 30B A3B Instruct 2507(fallback)Alibaba	Alibaba	80	--	--	--	80
22	DeepSeek V3 0324(fallback)DeepSeek	DeepSeek	79.3	--	--	--	79.3
23	Mistral Medium 3(fallback)Mistral AI	Mistral AI	78.3	--	--	--	78.3
24	Claude Opus 4.6(fallback)Anthropic	Anthropic	77.7	--	--	--	77.7
25	Qwen3 235B A22B(fallback)Alibaba	Alibaba	77.5	--	--	--	77.5
26	Grok 3 Mini Beta(fallback)xAI	xAI	76.7	--	--	--	76.7
27	MiniMax M1(fallback)MiniMax	MiniMax	75.7	--	--	--	75.7
28	Qwen3 32B(fallback)Alibaba	Alibaba	73.5	--	--	--	73.5
29	Llama 3.3 Nemotron Super 49B V1.5(fallback)NVIDIA	NVIDIA	73.2	--	--	--	73.2
30	o3 Mini High(fallback)OpenAI	OpenAI	73	--	--	--	73
31	Claude Sonnet 4.6(fallback)Anthropic	Anthropic	73	--	--	--	73
32	QwQ 32B(fallback)Alibaba	Alibaba	71.8	--	--	--	71.8
33	Gemini 2.0 Flash Lite(fallback)Google	Google	71.7	--	--	--	71.7
34	Qwen3 30B A3B(fallback)Alibaba	Alibaba	70	--	--	--	70
35	Claude 3.5 SonnetAnthropic	Anthropic	69.2	50.8	93.7	--	--
36	o1OpenAI	OpenAI	67.5	48.9	92.4	--	--
37	GPT-4o (2024-05-13)(fallback)OpenAI	OpenAI	67	--	--	--	67
38	GPT-4o-mini (2024-07-18)(fallback)OpenAI	OpenAI	64.7	--	--	--	64.7
39	GPT-4o (2024-08-06)(fallback)OpenAI	OpenAI	64.2	--	--	--	64.2
40	Llama 3.1 Nemotron 70B Instruct(fallback)NVIDIA	NVIDIA	64	--	--	--	64
41	Gemini 2.5 ProGoogle	Google	63.8	63.8	--	--	--
42	Qwen2.5 72B Instruct(fallback)Alibaba	Alibaba	62	--	--	--	62
43	Mistral Large 2407(fallback)Mistral AI	Mistral AI	61.5	--	--	--	61.5
44	Nova Pro 1.0(fallback)Amazon	Amazon	60.2	--	--	--	60.2
45	Qwen2.5 Coder 32B Instruct(fallback)Alibaba	Alibaba	55.7	--	--	--	55.7
46	Command R+ (08-2024)(fallback)Cohere	Cohere	55.5	--	--	--	55.5
47	Nova Lite 1.0(fallback)Amazon	Amazon	55.5	--	--	--	55.5
48	Llama 3 70B Instruct(fallback)Meta	Meta	54	--	--	--	54
49	Phi 4(fallback)Microsoft	Microsoft	53.7	--	--	--	53.7
50	Nova Micro 1.0(fallback)Amazon	Amazon	52.3	--	--	--	52.3
51	Claude 3 Haiku(fallback)Anthropic	Anthropic	50	--	--	--	50
52	o3 MiniOpenAI	OpenAI	49.3	49.3	--	--	--
53	R1DeepSeek	DeepSeek	49.2	49.2	--	--	--
54	Llama 3.1 8B Instruct(fallback)Meta	Meta	48.8	--	--	--	48.8
55	Llama 3 8B Instruct(fallback)Meta	Meta	45.2	--	--	--	45.2
56	Llama 3.2 3B Instruct(fallback)Meta	Meta	36.2	--	--	--	36.2
57	Llama 3.2 1B Instruct(fallback)Meta	Meta	27.8	--	--	--	27.8
58	Mistral 7B Instruct v0.1(fallback)Mistral AI	Mistral AI	21.8	--	--	--	21.8

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. Models without any primary benchmark data fall back to Arena Elo (normalized to 0-100) and are marked accordingly. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.

Other Specialty Leaderboards

Best for Math Best for Reasoning Best for Writing Best for Instructions Best for Data Analysis Best for Roleplay Best for Multilingual

Frequently Asked Questions

Based on our benchmark analysis, Mistral Large by Mistral AI is currently the #1 ranked model for coding, with a weighted score of 92/100.

Models are ranked using a weighted average of SWE-bench Verified, HumanEval, BigCodeBench benchmark scores. Models without primary benchmark data fall back to Arena Elo. All scores are normalized to a 0-100 scale.

We currently rank 58 models that have relevant benchmark data for coding tasks.