Ranking Confidence Explorer

Analyzes how confident we are in each model's ranking. Rank spreads show the range of positions a model could realistically hold, confidence levels indicate ranking precision, and stability states reflect consistency over time.

Confidence Overview

High-level summary of ranking confidence across all 293 models.

High Confidence

293

100.0% of models

Medium Confidence

0.0% of models

Low Confidence

0.0% of models

Avg Rank Spread

4.0

positions of uncertainty

Confidence Distribution

Breakdown of models by confidence level with averages for score, spread, and rank.

Confidence Level	Count	%	Avg Score	Avg Spread	Avg Rank
High	293	100.0%	49.1	4.0	147
Medium	0	0.0%	—	—	—
Low	0	0.0%	—	—	—

Most Precisely Ranked

Models with the tightest rank spreads. These are the rankings we are most confident about.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Pro	OpenAI	90.9	1	±2	High	Stable
2	GPT-5.2 Pro	OpenAI	89.9	2	±3	High	Stable
3	GPT-5 Pro	OpenAI	89.9	3	±4	High	Stable
4	o3 Pro	OpenAI	81.6	4	±4	High	Stable
5	Claude Opus 4.1	Anthropic	81.1	5	±4	High	Stable
6	o1-pro	OpenAI	77.2	6	±4	High	Stable
7	Claude Opus 4	Anthropic	75.5	7	±4	High	Stable
8	o3 Deep Research	OpenAI	74.0	8	±4	High	Stable
9	Claude Opus 4.6	Anthropic	70.5	9	±4	High	Stable
10	Claude Opus 4.5	Anthropic	70.0	10	±4	High	Stable
11	GPT-5.4	OpenAI	69.7	11	±4	High	Stable
12	Claude Sonnet 4.5	Anthropic	69.1	12	±4	High	Stable
13	Qwen3 VL 30B A3B Thinking	Alibaba	68.6	13	±4	High	Stable
14	Qwen3 VL 235B A22B Thinking	Alibaba	68.6	14	±4	High	Stable
15	GPT-5.2	OpenAI	68.4	15	±4	High	Stable
16	Gemini 3.1 Pro Preview Custom Tools	Google	68.2	16	±4	High	Stable
17	Gemini 3.1 Pro Preview	Google	68.2	17	±4	High	Stable
18	Gemini 3 Pro Preview	Google	68.2	18	±4	High	Stable
19	Claude Sonnet 4.6	Anthropic	68.0	19	±4	High	Stable
20	GPT-5.1	OpenAI	67.4	20	±4	High	Stable

Most Uncertain Rankings

Models with the widest rank spreads. These models could be ranked very differently under slight changes.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5 Pro	OpenAI	89.9	3	±4	High	Stable
2	o3 Pro	OpenAI	81.6	4	±4	High	Stable
3	Claude Opus 4.1	Anthropic	81.1	5	±4	High	Stable
4	o1-pro	OpenAI	77.2	6	±4	High	Stable
5	Claude Opus 4	Anthropic	75.5	7	±4	High	Stable
6	o3 Deep Research	OpenAI	74.0	8	±4	High	Stable
7	Claude Opus 4.6	Anthropic	70.5	9	±4	High	Stable
8	Claude Opus 4.5	Anthropic	70.0	10	±4	High	Stable
9	GPT-5.4	OpenAI	69.7	11	±4	High	Stable
10	Claude Sonnet 4.5	Anthropic	69.1	12	±4	High	Stable
11	Qwen3 VL 30B A3B Thinking	Alibaba	68.6	13	±4	High	Stable
12	Qwen3 VL 235B A22B Thinking	Alibaba	68.6	14	±4	High	Stable
13	GPT-5.2	OpenAI	68.4	15	±4	High	Stable
14	Gemini 3.1 Pro Preview Custom Tools	Google	68.2	16	±4	High	Stable
15	Gemini 3.1 Pro Preview	Google	68.2	17	±4	High	Stable
16	Gemini 3 Pro Preview	Google	68.2	18	±4	High	Stable
17	Claude Sonnet 4.6	Anthropic	68.0	19	±4	High	Stable
18	GPT-5.1	OpenAI	67.4	20	±4	High	Stable
19	GPT-5.3-Codex	OpenAI	66.8	21	±4	High	Stable
20	GPT-5.2-Codex	OpenAI	66.8	22	±4	High	Stable

State × Confidence Matrix

Cross-tabulation of confidence level and stability state. The best combination is high confidence + stable; the worst is low confidence + fragile.

Confidence	Stable	Held	Fragile	Preliminary
High	293	0	0	0
Medium	0	0	0	0
Low	0	0	0	0

Rank Spread Visualization

Visual representation of ranking uncertainty for the top 30 models. The bar shows the possible rank range at 90% confidence; the marker shows the actual rank.

#1GPT-5.4 Pro

1–3

#2GPT-5.2 Pro

1–4

#3GPT-5 Pro

1–5

#4o3 Pro

2–6

#5Claude Opus 4.1

3–7

#6o1-pro

4–8

5–9

6–10

7–11

8–12

#11GPT-5.4

9–13

#12Claude Sonnet 4.5

10–14

#13Qwen3 VL 30B A3B Thinking

11–15

#14Qwen3 VL 235B A22B Thinking

12–16

#15GPT-5.2

13–17

#16Gemini 3.1 Pro Preview Custom Tools

14–18

#17Gemini 3.1 Pro Preview

15–19

#18Gemini 3 Pro Preview

16–20

#19Claude Sonnet 4.6

17–21

#20GPT-5.1

18–22

#21GPT-5.3-Codex

19–23

#22GPT-5.2-Codex

20–24

#23GPT-5

21–25

#24Gemini 3 Flash Preview

22–26

#25o4 Mini Deep Research

23–27

#26GPT-5.1-Codex-Max

24–28

#27Gemini 3.1 Flash Lite Preview

25–29

#28Gemini 2.5 Pro

26–30

#29Gemini 2.5 Flash Lite Preview 09-2025

27–31

#30o1

28–32

Rank 1Rank 32

What Affects Confidence

How ranking confidence is determined and what the metrics mean.

Rank Spread

Computed via bootstrap resampling of the scoring pipeline. By running thousands of simulations with slight variations, we determine the range of ranks each model could realistically hold. The spread represents the 90% confidence interval: in 9 out of 10 cases, the model's true rank falls within this range.

Confidence Level

Derived from the rank spread width. Models with tight spreads (small uncertainty) receive high confidence, meaning their ranking position is reliable. Wider spreads indicate medium or low confidence, where the model's position could shift significantly with different weighting or data updates.

State

A stability classification based on the consistency of performance metrics over time. “Stable” models show consistent rankings, “held” models maintain position with some variance, “fragile” models are prone to rank shifts, and “preliminary” models lack enough data history to assess stability.

Explore More

Continue exploring AI model data with other explorers and trackers.

All Explorers Signals Stability Tracker

Confidence Level

Count

High

293

100.0%

Medium

0.0%

Low

0.0%

Most Precisely Ranked

Models with the tightest rank spreads. These are the rankings we are most confident about.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5.4 Pro	OpenAI	90.9	1	±2	High	Stable
2	GPT-5.2 Pro	OpenAI	89.9	2	±3	High	Stable
3	GPT-5 Pro	OpenAI	89.9	3	±4	High	Stable
4	o3 Pro	OpenAI	81.6	4	±4	High	Stable
5	Claude Opus 4.1	Anthropic	81.1	5	±4	High	Stable
6	o1-pro	OpenAI	77.2	6	±4	High	Stable
7	Claude Opus 4	Anthropic	75.5	7	±4	High	Stable
8	o3 Deep Research	OpenAI	74.0	8	±4	High	Stable
9	Claude Opus 4.6	Anthropic	70.5	9	±4	High	Stable
10	Claude Opus 4.5	Anthropic	70.0	10	±4	High	Stable
11	GPT-5.4	OpenAI	69.7	11	±4	High	Stable
12	Claude Sonnet 4.5	Anthropic	69.1	12	±4	High	Stable
13	Qwen3 VL 30B A3B Thinking	Alibaba	68.6	13	±4	High	Stable
14	Qwen3 VL 235B A22B Thinking	Alibaba	68.6	14	±4	High	Stable
15	GPT-5.2	OpenAI	68.4	15	±4	High	Stable
16	Gemini 3.1 Pro Preview Custom Tools	Google	68.2	16	±4	High	Stable
17	Gemini 3.1 Pro Preview	Google	68.2	17	±4	High	Stable
18	Gemini 3 Pro Preview	Google	68.2	18	±4	High	Stable
19	Claude Sonnet 4.6	Anthropic	68.0	19	±4	High	Stable
20	GPT-5.1	OpenAI	67.4	20	±4	High	Stable

Most Uncertain Rankings

Models with the widest rank spreads. These models could be ranked very differently under slight changes.

#	Model	Provider	Score	Rank	Spread	Confidence	State
1	GPT-5 Pro	OpenAI	89.9	3	±4	High	Stable
2	o3 Pro	OpenAI	81.6	4	±4	High	Stable
3	Claude Opus 4.1	Anthropic	81.1	5	±4	High	Stable
4	o1-pro	OpenAI	77.2	6	±4	High	Stable
5	Claude Opus 4	Anthropic	75.5	7	±4	High	Stable
6	o3 Deep Research	OpenAI	74.0	8	±4	High	Stable
7	Claude Opus 4.6	Anthropic	70.5	9	±4	High	Stable
8	Claude Opus 4.5	Anthropic	70.0	10	±4	High	Stable
9	GPT-5.4	OpenAI	69.7	11	±4	High	Stable
10	Claude Sonnet 4.5	Anthropic	69.1	12	±4	High	Stable
11	Qwen3 VL 30B A3B Thinking	Alibaba	68.6	13	±4	High	Stable
12	Qwen3 VL 235B A22B Thinking	Alibaba	68.6	14	±4	High	Stable
13	GPT-5.2	OpenAI	68.4	15	±4	High	Stable
14	Gemini 3.1 Pro Preview Custom Tools	Google	68.2	16	±4	High	Stable
15	Gemini 3.1 Pro Preview	Google	68.2	17	±4	High	Stable
16	Gemini 3 Pro Preview	Google	68.2	18	±4	High	Stable
17	Claude Sonnet 4.6	Anthropic	68.0	19	±4	High	Stable
18	GPT-5.1	OpenAI	67.4	20	±4	High	Stable
19	GPT-5.3-Codex	OpenAI	66.8	21	±4	High	Stable
20	GPT-5.2-Codex	OpenAI	66.8	22	±4	High	Stable

Confidence

Stable

Held

Fragile

Preliminary

High

293

Medium

Low

What Affects Confidence

How ranking confidence is determined and what the metrics mean.