Analyzes how confident we are in each model's ranking. Rank spreads show the range of positions a model could realistically hold, confidence levels indicate ranking precision, and stability states reflect consistency over time.
High-level summary of ranking confidence across all 300 models.
High Confidence
300
100.0% of models
Medium Confidence
0
0.0% of models
Low Confidence
0
0.0% of models
Avg Rank Spread
4.0
positions of uncertainty
Breakdown of models by confidence level with averages for score, spread, and rank.
| Confidence Level | Count | % |
|---|---|---|
| High | 300 | 100.0% |
| Medium | 0 | 0.0% |
| Low | 0 | 0.0% |
Models with the tightest rank spreads. These are the rankings we are most confident about.
| # | Model | Score | Rank | Spread |
|---|---|---|---|---|
| 1 | GPT-5.4 Pro | 94.0 | 1 | ±2 |
| 2 | GPT-5.4 | 94.0 | 2 | ±3 |
| 3 | GPT-5.4 Mini | 93.3 | 3 | ±4 |
| 4 | GPT-5.2 Pro | 92.7 | 4 | ±4 |
| 5 | GPT-5.2 | 92.7 | 5 | ±4 |
| 6 | Claude Opus 4.6 | 92.1 | 6 | ±4 |
| 7 | GPT-5 Pro | 91.9 | 7 | ±4 |
| 8 | o3 Deep Research | 91.5 | 8 | ±4 |
| 9 | Claude Opus 4.5 | 90.4 | 9 | ±4 |
| 10 | Gemini 3 Pro Preview | 90.3 | 10 | ±4 |
| 11 | GPT-5 | 90.2 | 11 | ±4 |
| 12 | Gemini 3 Flash Preview | 89.4 | 12 | ±4 |
| 13 | Claude Sonnet 4.6 | 89.2 | 13 | ±4 |
| 14 | Claude Sonnet 4.5 | 89.0 | 14 | ±4 |
| 15 | o3 Pro | 87.7 | 15 | ±4 |
| 16 | Grok 4.1 Fast | 86.9 | 16 | ±4 |
| 17 | Grok 4 | 85.8 | 17 | ±4 |
| 18 | Grok 4.20 Beta | 85.7 | 18 | ±4 |
| 19 | o3 | 85.7 | 19 | ±4 |
| 20 | Gemini 3.1 Pro Preview | 85.5 | 20 | ±4 |
Models with the widest rank spreads. These models could be ranked very differently under slight changes.
| # | Model | Score | Rank | Spread |
|---|---|---|---|---|
| 1 | GPT-5.4 Mini | 93.3 | 3 | ±4 |
| 2 | GPT-5.2 Pro | 92.7 | 4 | ±4 |
| 3 | GPT-5.2 | 92.7 | 5 | ±4 |
| 4 | Claude Opus 4.6 | 92.1 | 6 | ±4 |
| 5 | GPT-5 Pro | 91.9 | 7 | ±4 |
| 6 | o3 Deep Research | 91.5 | 8 | ±4 |
| 7 | Claude Opus 4.5 | 90.4 | 9 | ±4 |
| 8 | Gemini 3 Pro Preview | 90.3 | 10 | ±4 |
| 9 | GPT-5 | 90.2 | 11 | ±4 |
| 10 | Gemini 3 Flash Preview | 89.4 | 12 | ±4 |
| 11 | Claude Sonnet 4.6 | 89.2 | 13 | ±4 |
| 12 | Claude Sonnet 4.5 | 89.0 | 14 | ±4 |
| 13 | o3 Pro | 87.7 | 15 | ±4 |
| 14 | Grok 4.1 Fast | 86.9 | 16 | ±4 |
| 15 | Grok 4 | 85.8 | 17 | ±4 |
| 16 | Grok 4.20 Beta | 85.7 | 18 | ±4 |
| 17 | o3 | 85.7 | 19 | ±4 |
| 18 | Gemini 3.1 Pro Preview | 85.5 | 20 | ±4 |
| 19 | GPT-5.1 | 85.2 | 21 | ±4 |
| 20 | MiMo-V2-Omni | 85.0 | 22 | ±4 |
Cross-tabulation of confidence level and stability state. The best combination is high confidence + stable; the worst is low confidence + fragile.
| Confidence | Stable | Held | Fragile | Preliminary |
|---|---|---|---|---|
| High | 135 | 0 | 159 | 6 |
| Medium | 0 | 0 | 0 | 0 |
| Low | 0 | 0 | 0 | 0 |
Visual representation of ranking uncertainty for the top 30 models. The bar shows the possible rank range at 90% confidence; the marker shows the actual rank.
How ranking confidence is determined and what the metrics mean.
Computed via bootstrap resampling of the scoring pipeline. By running thousands of simulations with slight variations, we determine the range of ranks each model could realistically hold. The spread represents the 90% confidence interval: in 9 out of 10 cases, the model's true rank falls within this range.
Derived from the rank spread width. Models with tight spreads (small uncertainty) receive high confidence, meaning their ranking position is reliable. Wider spreads indicate medium or low confidence, where the model's position could shift significantly with different weighting or data updates.
A stability classification based on the consistency of performance metrics over time. “Stable” models show consistent rankings, “held” models maintain position with some variance, “fragile” models are prone to rank shifts, and “preliminary” models lack enough data history to assess stability.
Continue exploring AI model data with other explorers and trackers.
Ranking confidence is calculated using bootstrap resampling - a statistical technique that re-runs the ranking process thousands of times with slight variations to see how stable each model's position is. Models with narrow rank spreads have high confidence, while those with wide spreads have uncertain rankings.
Rank spread is the range between a model's best and worst possible rank across bootstrap simulations. A rank spread of 2 means the model might move 1 position up or down, while a spread of 20 means its true ranking is quite uncertain.
Low confidence usually means the model scores are clustered closely together with many competitors, making the exact ordering sensitive to small measurement differences. Models in the middle of the leaderboard tend to have wider rank spreads than those at the very top or bottom.