Ranking Confidence Explorer

Analyzes how confident we are in each model's ranking. Rank spreads show the range of positions a model could realistically hold, confidence levels indicate ranking precision, and stability states reflect consistency over time.

Confidence Level Distribution

Confidence Overview

High-level summary of ranking confidence across all 300 models.

High Confidence

300

100.0% of models

Medium Confidence

0.0% of models

Low Confidence

0.0% of models

Avg Rank Spread

4.0

positions of uncertainty

Confidence Distribution

Breakdown of models by confidence level with averages for score, spread, and rank.

Confidence Level	Count	%	Avg Score	Avg Spread	Avg Rank
High	300	100.0%	68.2	4.0	151
Medium	0	0.0%	—	—	—
Low	0	0.0%	—	—	—

Most Precisely Ranked

Models with the tightest rank spreads. These are the rankings we are most confident about.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Pro	OpenAI	94.0	1	±2	High	Stable
2	GPT-5.4	OpenAI	94.0	2	±3	High	Stable
3	GPT-5.4 Mini	OpenAI	93.3	3	±4	High	Preliminary
4	GPT-5.2 Pro	OpenAI	92.7	4	±4	High	Stable
5	GPT-5.2	OpenAI	92.7	5	±4	High	Stable
6	Claude Opus 4.6	Anthropic	92.1	6	±4	High	Stable
7	GPT-5 Pro	OpenAI	91.9	7	±4	High	Stable
8	o3 Deep Research	OpenAI	91.5	8	±4	High	Stable
9	Claude Opus 4.5	Anthropic	90.4	9	±4	High	Stable
10	Gemini 3 Pro Preview	Google	90.3	10	±4	High	Stable
11	GPT-5	OpenAI	90.2	11	±4	High	Stable
12	Gemini 3 Flash Preview	Google	89.4	12	±4	High	Stable
13	Claude Sonnet 4.6	Anthropic	89.2	13	±4	High	Stable
14	Claude Sonnet 4.5	Anthropic	89.0	14	±4	High	Stable
15	o3 Pro	OpenAI	87.7	15	±4	High	Fragile
16	Grok 4.1 Fast	xAI	86.9	16	±4	High	Stable
17	Grok 4	xAI	85.8	17	±4	High	Fragile
18	Grok 4.20 Beta	xAI	85.7	18	±4	High	Fragile
19	o3	OpenAI	85.7	19	±4	High	Fragile
20	Gemini 3.1 Pro Preview	Google	85.5	20	±4	High	Stable

Most Uncertain Rankings

Models with the widest rank spreads. These models could be ranked very differently under slight changes.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Mini	OpenAI	93.3	3	±4	High	Preliminary
2	GPT-5.2 Pro	OpenAI	92.7	4	±4	High	Stable
3	GPT-5.2	OpenAI	92.7	5	±4	High	Stable
4	Claude Opus 4.6	Anthropic	92.1	6	±4	High	Stable
5	GPT-5 Pro	OpenAI	91.9	7	±4	High	Stable
6	o3 Deep Research	OpenAI	91.5	8	±4	High	Stable
7	Claude Opus 4.5	Anthropic	90.4	9	±4	High	Stable
8	Gemini 3 Pro Preview	Google	90.3	10	±4	High	Stable
9	GPT-5	OpenAI	90.2	11	±4	High	Stable
10	Gemini 3 Flash Preview	Google	89.4	12	±4	High	Stable
11	Claude Sonnet 4.6	Anthropic	89.2	13	±4	High	Stable
12	Claude Sonnet 4.5	Anthropic	89.0	14	±4	High	Stable
13	o3 Pro	OpenAI	87.7	15	±4	High	Fragile
14	Grok 4.1 Fast	xAI	86.9	16	±4	High	Stable
15	Grok 4	xAI	85.8	17	±4	High	Fragile
16	Grok 4.20 Beta	xAI	85.7	18	±4	High	Fragile
17	o3	OpenAI	85.7	19	±4	High	Fragile
18	Gemini 3.1 Pro Preview	Google	85.5	20	±4	High	Stable
19	GPT-5.1	OpenAI	85.2	21	±4	High	Fragile
20	MiMo-V2-Omni	Xiaomi	85.0	22	±4	High	Preliminary

State × Confidence Matrix

Cross-tabulation of confidence level and stability state. The best combination is high confidence + stable; the worst is low confidence + fragile.

Confidence	Stable	Fragile	Preliminary
High	135	159	6
Medium	0	0	0
Low	0	0	0

Rank Spread Visualization

Visual representation of ranking uncertainty for the top 30 models. The bar shows the possible rank range at 90% confidence; the marker shows the actual rank.

#1GPT-5.4 Pro

1–3

#2GPT-5.4

1–4

#3GPT-5.4 Mini

1–5

#4GPT-5.2 Pro

2–6

#5GPT-5.2

3–7

4–8

5–9

6–10

7–11

#10Gemini 3 Pro Preview

8–12

#11GPT-5

9–13

#12Gemini 3 Flash Preview

10–14

#13Claude Sonnet 4.6

11–15

#14Claude Sonnet 4.5

12–16

#15o3 Pro

13–17

#16Grok 4.1 Fast

14–18

#17Grok 4

15–19

#18Grok 4.20 Beta

16–20

#19o3

17–21

#20Gemini 3.1 Pro Preview

18–22

#21GPT-5.1

19–23

#22MiMo-V2-Omni

20–24

#23MiMo-V2-Pro

21–25

22–26

23–27

24–28

25–29

#28Gemini 3.1 Pro Preview Custom Tools

26–30

#29GPT-5.3-Codex

27–31

#30Qwen3.5 Plus 2026-02-15

28–32

Rank 1Rank 32

What Affects Confidence

How ranking confidence is determined and what the metrics mean.

Rank Spread

Computed via bootstrap resampling of the scoring pipeline. By running thousands of simulations with slight variations, we determine the range of ranks each model could realistically hold. The spread represents the 90% confidence interval: in 9 out of 10 cases, the model's true rank falls within this range.

Confidence Level

Derived from the rank spread width. Models with tight spreads (small uncertainty) receive high confidence, meaning their ranking position is reliable. Wider spreads indicate medium or low confidence, where the model's position could shift significantly with different weighting or data updates.

State

A stability classification based on the consistency of performance metrics over time. “Stable” models show consistent rankings, “held” models maintain position with some variance, “fragile” models are prone to rank shifts, and “preliminary” models lack enough data history to assess stability.

探索更多

Continue exploring AI model data with other explorers and trackers.

All Explorers Signals Stability Tracker

Frequently Asked Questions

Ranking confidence is calculated using bootstrap resampling - a statistical technique that re-runs the ranking process thousands of times with slight variations to see how stable each model's position is. Models with narrow rank spreads have high confidence, while those with wide spreads have uncertain rankings.

Rank spread is the range between a model's best and worst possible rank across bootstrap simulations. A rank spread of 2 means the model might move 1 position up or down, while a spread of 20 means its true ranking is quite uncertain.

Low confidence usually means the model scores are clustered closely together with many competitors, making the exact ordering sensitive to small measurement differences. Models in the middle of the leaderboard tend to have wider rank spreads than those at the very top or bottom.

Confidence Level

Count

High

300

100.0%

Medium

0.0%

Low

0.0%

Most Precisely Ranked

Models with the tightest rank spreads. These are the rankings we are most confident about.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Pro	OpenAI	94.0	1	±2	High	Stable
2	GPT-5.4	OpenAI	94.0	2	±3	High	Stable
3	GPT-5.4 Mini	OpenAI	93.3	3	±4	High	Preliminary
4	GPT-5.2 Pro	OpenAI	92.7	4	±4	High	Stable
5	GPT-5.2	OpenAI	92.7	5	±4	High	Stable
6	Claude Opus 4.6	Anthropic	92.1	6	±4	High	Stable
7	GPT-5 Pro	OpenAI	91.9	7	±4	High	Stable
8	o3 Deep Research	OpenAI	91.5	8	±4	High	Stable
9	Claude Opus 4.5	Anthropic	90.4	9	±4	High	Stable
10	Gemini 3 Pro Preview	Google	90.3	10	±4	High	Stable
11	GPT-5	OpenAI	90.2	11	±4	High	Stable
12	Gemini 3 Flash Preview	Google	89.4	12	±4	High	Stable
13	Claude Sonnet 4.6	Anthropic	89.2	13	±4	High	Stable
14	Claude Sonnet 4.5	Anthropic	89.0	14	±4	High	Stable
15	o3 Pro	OpenAI	87.7	15	±4	High	Fragile
16	Grok 4.1 Fast	xAI	86.9	16	±4	High	Stable
17	Grok 4	xAI	85.8	17	±4	High	Fragile
18	Grok 4.20 Beta	xAI	85.7	18	±4	High	Fragile
19	o3	OpenAI	85.7	19	±4	High	Fragile
20	Gemini 3.1 Pro Preview	Google	85.5	20	±4	High	Stable

Most Uncertain Rankings

Models with the widest rank spreads. These models could be ranked very differently under slight changes.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Mini	OpenAI	93.3	3	±4	High	Preliminary
2	GPT-5.2 Pro	OpenAI	92.7	4	±4	High	Stable
3	GPT-5.2	OpenAI	92.7	5	±4	High	Stable
4	Claude Opus 4.6	Anthropic	92.1	6	±4	High	Stable
5	GPT-5 Pro	OpenAI	91.9	7	±4	High	Stable
6	o3 Deep Research	OpenAI	91.5	8	±4	High	Stable
7	Claude Opus 4.5	Anthropic	90.4	9	±4	High	Stable
8	Gemini 3 Pro Preview	Google	90.3	10	±4	High	Stable
9	GPT-5	OpenAI	90.2	11	±4	High	Stable
10	Gemini 3 Flash Preview	Google	89.4	12	±4	High	Stable
11	Claude Sonnet 4.6	Anthropic	89.2	13	±4	High	Stable
12	Claude Sonnet 4.5	Anthropic	89.0	14	±4	High	Stable
13	o3 Pro	OpenAI	87.7	15	±4	High	Fragile
14	Grok 4.1 Fast	xAI	86.9	16	±4	High	Stable
15	Grok 4	xAI	85.8	17	±4	High	Fragile
16	Grok 4.20 Beta	xAI	85.7	18	±4	High	Fragile
17	o3	OpenAI	85.7	19	±4	High	Fragile
18	Gemini 3.1 Pro Preview	Google	85.5	20	±4	High	Stable
19	GPT-5.1	OpenAI	85.2	21	±4	High	Fragile
20	MiMo-V2-Omni	Xiaomi	85.0	22	±4	High	Preliminary

Confidence

Stable

Held

Fragile

Preliminary

High

135

159

Medium

Low

What Affects Confidence

How ranking confidence is determined and what the metrics mean.