Last updated: just now

Arena benchmark

LMSYS Chatbot Arena Elo Rating

Human preference rating from 6M+ crowdsourced blind head-to-head comparisons. Users chat with two anonymous models and pick the better response.

Why it matters: The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Top Model

1,446

Gemini 2.5 Pro Preview 05-06

Average Score

1,298

Across 65 models

Models Tested

Metric: Elo rating

Human Baseline

—

Range: 900–1600

Model Rankings

All models with a reported Arena Elo score, ranked by highest Elo rating.

RankModelScorePerformance

Gemini 2.5 Pro Preview 05-06Google

1,446

78%

1,446

Gemini 2.5 ProGoogle

1,444

78%

1,444

Claude Opus 4.5Anthropic

1,430

76%

1,430

o3OpenAI

1,428

75%

1,428

Qwen3.5 397B A17BAlibaba

1,427

75%

1,427

Grok 3xAI

1,423

75%

1,423

Llama 4 MaverickMeta

1,416

74%

1,416

Qwen3 235B A22B Thinking 2507Alibaba

1,413

73%

1,413

Gemini 2.5 FlashGoogle

1,411

73%

1,411

#10

GPT-4o (2024-11-20)OpenAI

1,381

69%

1,381

#11

GPT-4.1OpenAI

1,380

69%

1,380

#11

Qwen3 30B A3B Instruct 2507Alibaba

1,380

69%

1,380

#13

DeepSeek V3 0324DeepSeek

1,376

68%

1,376

#14

o3-miniOpenAI

1,371

67%

1,371

#15

Mistral Medium 3Mistral AI

1,370

67%

1,370

#16

DeepSeek R1DeepSeek

1,369

67%

1,369

#17

Claude Opus 4.6Anthropic

1,366

67%

1,366

#18

Qwen3 235B A22BAlibaba

1,365

66%

1,365

#19

Grok 3 Mini BetaxAI

1,360

66%

1,360

#20

Claude 3.7 SonnetAnthropic

1,354

65%

1,354

#20

MiniMax M1MiniMax

1,354

65%

1,354

#22

Gemini 2.0 FlashGoogle

1,352

65%

1,352

#23

o1OpenAI

1,350

64%

1,350

#24

Qwen3 32BAlibaba

1,341

63%

1,341

#25

Llama 3.3 Nemotron Super 49B V1.5NVIDIA

1,339

63%

1,339

#26

o3 Mini HighOpenAI

1,338

63%

1,338

#26

Claude Sonnet 4.6Anthropic

1,338

63%

1,338

#28

QwQ 32BAlibaba

1,331

62%

1,331

#29

Gemini 2.0 Flash LiteGoogle

1,330

61%

1,330

#30

Qwen3 30B A3BAlibaba

1,320

60%

1,320

#31

DeepSeek V3DeepSeek

1,318

60%

1,318

#32

o1-miniOpenAI

1,304

58%

1,304

#33

GPT-4o (2024-05-13)OpenAI

1,302

57%

1,302

#34

GPT-4o-mini (2024-07-18)OpenAI

1,288

55%

1,288

#35

GPT-4oOpenAI

1,286

55%

1,286

#36

GPT-4o (2024-08-06)OpenAI

1,285

55%

1,285

#37

Llama 3.1 Nemotron 70B InstructNVIDIA

1,284

55%

1,284

#38

Gemini 1.5 ProGoogle

1,281

54%

1,281

#39

Qwen2.5 72B InstructAlibaba

1,272

53%

1,272

#40

Claude 3.5 SonnetAnthropic

1,271

53%

1,271

#41

Mistral Large 2407Mistral AI

1,269

53%

1,269

#42

Grok 2xAI

1,262

52%

1,262

#43

Qwen 2.5 72BAlibaba

1,261

52%

1,261

#43

Nova Pro 1.0Amazon

1,261

52%

1,261

#45

GPT-4 TurboOpenAI

1,255

51%

1,255

#46

Mistral Large 2Mistral AI

1,250

50%

1,250

#47

Llama 3.3 70BMeta

1,243

49%

1,243

#48

Qwen2.5 Coder 32B InstructAlibaba

1,234

48%

1,234

#49

Command R+ (08-2024)Cohere

1,233

48%

1,233

#49

Nova Lite 1.0Amazon

1,233

48%

1,233

#51

Claude 3 OpusAnthropic

1,232

47%

1,232

#52

Llama 3.1 405BMeta

1,229

47%

1,229

#53

Llama 3 70B InstructMeta

1,224

46%

1,224

#54

GPT-4o miniOpenAI

1,222

46%

1,222

#54

Phi 4Microsoft

1,222

46%

1,222

#56

Nova Micro 1.0Amazon

1,214

45%

1,214

#57

Claude 3 HaikuAnthropic

1,200

43%

1,200

#58

Llama 3.1 70BMeta

1,198

43%

1,198

#59

Llama 3.1 8B InstructMeta

1,193

42%

1,193

#60

Claude 3.5 HaikuAnthropic

1,178

40%

1,178

#61

Llama 3 8B InstructMeta

1,171

39%

1,171

#62

Mixtral 8x22BMistral AI

1,146

35%

1,146

#63

Llama 3.2 3B InstructMeta

1,117

31%

1,117

#64

Llama 3.2 1B InstructMeta

1,067

24%

1,067

#65

Mistral 7B Instruct v0.1Mistral AI

1,031

19%

1,031

About Arena Elo

Full Name: LMSYS Chatbot Arena Elo Rating
Category: Arena
Metric: Elo rating
Score Range: 900–1600
Human Baseline: Not established
Status: Active

Frequently Asked Questions

Arena Elo is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Gemini 2.5 Pro Preview 05-06 currently holds the top score on the Arena Elo benchmark. See our full rankings table above for the complete leaderboard with 65 models.

We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While Arena Elo is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

All Benchmarks|Arena Benchmarks|Compare Models

LMSYS Chatbot Arena Elo Rating

Human preference rating from 6M+ crowdsourced blind head-to-head comparisons. Users chat with two anonymous models and pick the better response.

Why it matters: The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Model Rankings

All models with a reported Arena Elo score, ranked by highest Elo rating.