AI models ranked by instruction-following accuracy using the IFEval benchmark.
Claude 3.7 Sonnet
Score: 92.3
71
Across all ranked models
20
With benchmark data
| # | Model | Score |
|---|---|---|
| 1 | Claude 3.7 SonnetAnthropic | 92.3 |
| 2 | Llama 3.3 70B InstructMeta | 92.1 |
| 3 | Claude 3.5 SonnetAnthropic | 88.1 |
| 4 | DeepSeek V3DeepSeek | 87.1 |
| 5 | o1OpenAI | 86.5 |
| 6 | Qwen2.5 72B InstructAlibaba | 86.4 |
| 7 | GPT-4oOpenAI | 84.3 |
| 8 | Llama 3.1 70B InstructMeta | 83.6 |
| 9 | Mistral LargeMistral AI | 82.4 |
| 10 | GPT-4o-miniOpenAI | 80.4 |
| 11 | Command R7B (12-2024)Cohere | 77.1 |
| 12 | Qwen2.5 7B InstructAlibaba | 75.9 |
| 13 | Qwen2.5 Coder 32B InstructAlibaba | 72.7 |
| 14 | Llama 3.1 8B InstructMeta | 72.1 |
| 15 | Llama 3.2 3B InstructMeta | 68.5 |
| 16 | Qwen2.5 Coder 7B InstructAlibaba | 61.5 |
| 17 | Gemma 2 9BGoogle | 58.8 |
| 18 | QwQ 32BAlibaba | 39.8 |
| 19 | Llama 3 8B InstructMeta | 24 |
| 20 | Phi 4Microsoft | 5.9 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, Claude 3.7 Sonnet by Anthropic is currently the #1 ranked model for instructions, with a weighted score of 92.3/100.
Models are ranked using a weighted average of IFEval benchmark scores. All scores are normalized to a 0-100 scale.
We currently rank 20 models that have relevant benchmark data for instructions tasks.