Last updated: just now

Instruction benchmark

Instruction Following Evaluation

Tests whether models follow explicit, verifiable constraints like 'write in more than 400 words' or 'mention AI at least 3 times'. All instructions have objectively verifiable criteria.

Why it matters: Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.

Top Model

92.3%

Claude 3.7 Sonnet

Average Score

70.6%

Across 24 models

Models Tested

Metric: prompt-level accuracy

Human Baseline

—

Range: 0%–100%

Model Rankings

All models with a reported IFEval score, ranked by highest prompt-level accuracy.

RankModelScorePerformance

Claude 3.7 SonnetAnthropic

92.3%

92%

92.3%

Llama 3.3 70BMeta

92.1%

92%

92.1%

Claude 3.5 SonnetAnthropic

88.1%

88%

88.1%

Llama 3.1 405BMeta

87.5%

88%

87.5%

DeepSeek V3DeepSeek

87.1%

87%

87.1%

o1OpenAI

86.5%

87%

86.5%

Qwen2.5 72B InstructAlibaba

86.4%

86%

86.4%

GPT-4oOpenAI

84.3%

84%

84.3%

Llama 3.1 70BMeta

83.6%

84%

83.6%

#10

Qwen 2.5 72BAlibaba

83.5%

84%

83.5%

#11

Mistral Large 2Mistral AI

82.4%

82%

82.4%

#12

GPT-4o miniOpenAI

80.4%

80%

80.4%

#13

Command R7B (12-2024)Cohere

77.1%

77%

77.1%

#14

Command R+Cohere

76.6%

77%

76.6%

#15

Qwen2.5 7B InstructAlibaba

75.8%

76%

75.8%

#16

Qwen2.5 Coder 32B InstructAlibaba

72.7%

73%

72.7%

#17

Llama 3.1 8B InstructMeta

72.0%

72%

72.0%

#18

Llama 3.2 3B InstructMeta

68.5%

68%

68.5%

#19

Qwen2.5 Coder 7B InstructAlibaba

61.5%

61%

61.5%

#20

Gemma 2 9BGoogle

58.8%

59%

58.8%

#21

QwQ 32BAlibaba

39.8%

40%

39.8%

#22

Runway Gen-3 AlphaRunway

28.0%

28%

28.0%

#23

Llama 3 8B InstructMeta

24.0%

24%

24.0%

#24

Phi 4Microsoft

5.8%

About IFEval

Full Name: Instruction Following Evaluation
Category: Instruction
Metric: prompt-level accuracy (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Active

Frequently Asked Questions

IFEval is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Claude 3.7 Sonnet currently holds the top score on the IFEval benchmark. See our full rankings table above for the complete leaderboard with 24 models.

We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While IFEval is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

All Benchmarks|Instruction Benchmarks|Compare Models

Instruction Following Evaluation

Tests whether models follow explicit, verifiable constraints like 'write in more than 400 words' or 'mention AI at least 3 times'. All instructions have objectively verifiable criteria.

Why it matters: Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.

Model Rankings

All models with a reported IFEval score, ranked by highest prompt-level accuracy.