Analyzes how confident we are in each model's ranking. Rank spreads show the range of positions a model could realistically hold, confidence levels indicate ranking precision, and stability states reflect consistency over time.
High-level summary of ranking confidence across all 293 models.
High Confidence
293
100.0% of models
Medium Confidence
0
0.0% of models
Low Confidence
0
0.0% of models
Avg Rank Spread
4.0
positions of uncertainty
Breakdown of models by confidence level with averages for score, spread, and rank.
| Confidence Level | Count | % |
|---|---|---|
| High | 293 | 100.0% |
| Medium | 0 | 0.0% |
| Low | 0 | 0.0% |
Models with the tightest rank spreads. These are the rankings we are most confident about.
| # | Model | Score | Rank | Spread |
|---|---|---|---|---|
| 1 | GPT-5.4 Pro | 90.9 | 1 | ±2 |
| 2 | GPT-5.2 Pro | 89.9 | 2 | ±3 |
| 3 | GPT-5 Pro | 89.9 | 3 | ±4 |
| 4 | o3 Pro | 81.6 | 4 | ±4 |
| 5 | Claude Opus 4.1 | 81.1 | 5 | ±4 |
| 6 | o1-pro | 77.2 | 6 | ±4 |
| 7 | Claude Opus 4 | 75.5 | 7 | ±4 |
| 8 | o3 Deep Research | 74.0 | 8 | ±4 |
| 9 | Claude Opus 4.6 | 70.5 | 9 | ±4 |
| 10 | Claude Opus 4.5 | 70.0 | 10 | ±4 |
| 11 | GPT-5.4 | 69.7 | 11 | ±4 |
| 12 | Claude Sonnet 4.5 | 69.1 | 12 | ±4 |
| 13 | Qwen3 VL 30B A3B Thinking | 68.6 | 13 | ±4 |
| 14 | Qwen3 VL 235B A22B Thinking | 68.6 | 14 | ±4 |
| 15 | GPT-5.2 | 68.4 | 15 | ±4 |
| 16 | Gemini 3.1 Pro Preview Custom Tools | 68.2 | 16 | ±4 |
| 17 | Gemini 3.1 Pro Preview | 68.2 | 17 | ±4 |
| 18 | Gemini 3 Pro Preview | 68.2 | 18 | ±4 |
| 19 | Claude Sonnet 4.6 | 68.0 | 19 | ±4 |
| 20 | GPT-5.1 | 67.4 | 20 | ±4 |
Models with the widest rank spreads. These models could be ranked very differently under slight changes.
| # | Model | Score | Rank | Spread |
|---|---|---|---|---|
| 1 | GPT-5 Pro | 89.9 | 3 | ±4 |
| 2 | o3 Pro | 81.6 | 4 | ±4 |
| 3 | Claude Opus 4.1 | 81.1 | 5 | ±4 |
| 4 | o1-pro | 77.2 | 6 | ±4 |
| 5 | Claude Opus 4 | 75.5 | 7 | ±4 |
| 6 | o3 Deep Research | 74.0 | 8 | ±4 |
| 7 | Claude Opus 4.6 | 70.5 | 9 | ±4 |
| 8 | Claude Opus 4.5 | 70.0 | 10 | ±4 |
| 9 | GPT-5.4 | 69.7 | 11 | ±4 |
| 10 | Claude Sonnet 4.5 | 69.1 | 12 | ±4 |
| 11 | Qwen3 VL 30B A3B Thinking | 68.6 | 13 | ±4 |
| 12 | Qwen3 VL 235B A22B Thinking | 68.6 | 14 | ±4 |
| 13 | GPT-5.2 | 68.4 | 15 | ±4 |
| 14 | Gemini 3.1 Pro Preview Custom Tools | 68.2 | 16 | ±4 |
| 15 | Gemini 3.1 Pro Preview | 68.2 | 17 | ±4 |
| 16 | Gemini 3 Pro Preview | 68.2 | 18 | ±4 |
| 17 | Claude Sonnet 4.6 | 68.0 | 19 | ±4 |
| 18 | GPT-5.1 | 67.4 | 20 | ±4 |
| 19 | GPT-5.3-Codex | 66.8 | 21 | ±4 |
| 20 | GPT-5.2-Codex | 66.8 | 22 | ±4 |
Cross-tabulation of confidence level and stability state. The best combination is high confidence + stable; the worst is low confidence + fragile.
| Confidence | Stable | Held | Fragile | Preliminary |
|---|---|---|---|---|
| High | 293 | 0 | 0 | 0 |
| Medium | 0 | 0 | 0 | 0 |
| Low | 0 | 0 | 0 | 0 |
Visual representation of ranking uncertainty for the top 30 models. The bar shows the possible rank range at 90% confidence; the marker shows the actual rank.
How ranking confidence is determined and what the metrics mean.
Computed via bootstrap resampling of the scoring pipeline. By running thousands of simulations with slight variations, we determine the range of ranks each model could realistically hold. The spread represents the 90% confidence interval: in 9 out of 10 cases, the model's true rank falls within this range.
Derived from the rank spread width. Models with tight spreads (small uncertainty) receive high confidence, meaning their ranking position is reliable. Wider spreads indicate medium or low confidence, where the model's position could shift significantly with different weighting or data updates.
A stability classification based on the consistency of performance metrics over time. “Stable” models show consistent rankings, “held” models maintain position with some variance, “fragile” models are prone to rank shifts, and “preliminary” models lack enough data history to assess stability.
Continue exploring AI model data with other explorers and trackers.