AI models ranked by reasoning ability using GPQA, ARC-Challenge, BIG-Bench Hard, and Humanity's Last Exam scores.
GPT-4o
Score: 96.4
33.3
Across all ranked models
17
With benchmark data
| # | Model | Score |
|---|---|---|
| 1 | GPT-4oOpenAI | 96.4 |
| 2 | GPT-4o-miniOpenAI | 96.4 |
| 3 | Llama 3.1 70B InstructMeta | 94.8 |
| 4 | Gemini 2.5 ProGoogle | 35.2 |
| 5 | Claude Opus 4.5Anthropic | 32.1 |
| 6 | Qwen2.5 72B InstructAlibaba | 31.7 |
| 7 | Phi 4Microsoft | 31.6 |
| 8 | Qwen2.5 Coder 32B InstructAlibaba | 26.2 |
| 9 | o3 MiniOpenAI | 20.3 |
| 10 | Gemma 2 9BGoogle | 18.4 |
| 11 | Command R7B (12-2024)Cohere | 17.2 |
| 12 | Qwen2.5 7B InstructAlibaba | 15.3 |
| 13 | Llama 3.1 8B InstructMeta | 15.2 |
| 14 | Qwen2.5 Coder 7B InstructAlibaba | 13.5 |
| 15 | Llama 3.2 3B InstructMeta | 12.2 |
| 16 | Llama 3 8B InstructMeta | 7.5 |
| 17 | QwQ 32BAlibaba | 1.9 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, GPT-4o by OpenAI is currently the #1 ranked model for reasoning, with a weighted score of 96.4/100.
Models are ranked using a weighted average of GPQA, ARC-Challenge, BIG-Bench Hard, Humanity's Last Exam benchmark scores. All scores are normalized to a 0-100 scale.
We currently rank 17 models that have relevant benchmark data for reasoning tasks.