Last updated: just now

Knowledge benchmark

MMLU Professional

Harder version of MMLU with reasoning-focused questions and 10 answer choices instead of 4. Contains 12,000+ questions across 14 domains.

Why it matters: Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.

Top Model

84%

DeepSeek R1

Average Score

44.1%

Across 19 models

Models Tested

Metric: accuracy

Human Baseline

—

Range: 0%–100%

Model Rankings

All models with a reported MMLU-Pro score, ranked by highest accuracy.

RankModelScorePerformance

DeepSeek R1DeepSeek

84%

DeepSeek V3DeepSeek

75.9%

76%

75.9%

Claude 3.7 SonnetAnthropic

74%

GPT-4oOpenAI

72.6%

73%

72.6%

Qwen 2.5 72BAlibaba

71.1%

71%

71.1%

Llama 3.3 70BMeta

68.9%

69%

68.9%

Qwen2.5 72B InstructAlibaba

51.4%

51%

51.4%

Phi 4Microsoft

47.6%

48%

47.6%

Qwen2.5 Coder 32B InstructAlibaba

37.9%

38%

37.9%

#10

Qwen2.5 7B InstructAlibaba

36.5%

37%

36.5%

#11

Command R+Cohere

33.2%

33%

33.2%

#12

Gemma 2 9BGoogle

33.1%

33%

33.1%

#13

Llama 3.1 8B InstructMeta

30.4%

30%

30.4%

#14

Command R7B (12-2024)Cohere

28.6%

29%

28.6%

#15

Qwen2.5 Coder 7B InstructAlibaba

26.2%

26%

26.2%

#16

Llama 3.2 3B InstructMeta

23.7%

24%

23.7%

#17

Runway Gen-3 AlphaRunway

22.3%

22%

22.3%

#18

Llama 3 8B InstructMeta

17.8%

18%

17.8%

#19

QwQ 32BAlibaba

2.2%

About MMLU-Pro

Full Name: MMLU Professional
Category: Knowledge
Metric: accuracy (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

MMLU-Pro is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

DeepSeek R1 currently holds the top score on the MMLU-Pro benchmark. See our full rankings table above for the complete leaderboard with 19 models.

We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While MMLU-Pro is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

All Benchmarks|Knowledge Benchmarks|Compare Models