Model Stability Report

Which AI models are the most consistent over time? This report analyzes rank changes, state classifications, and sparkline volatility across 300 tracked models to produce a stability score from 0 to 100.

Rock Solid

Consistent

Variable

Volatile

129

Stability Classification Distribution

Provider Stability Rankings (Avg Score)

Most Stable Models

Top 20 models with the highest stability scores. These models maintain consistent rankings with minimal volatility.

#	Model	Provider	Score	Stability	24h	7d	State	Rank Spread
1	GPT-5.4OpenAI	OpenAI	94.0	100	0	-1	stable	±2
2	Llemma 7beleutherai	eleutherai	47.5	100	0	-1	stable	±2
3	GPT-3.5 Turbo (older v0613)OpenAI	OpenAI	38.0	100	-1	0	stable	±2
4	Inflection 3 ProductivityInflection	Inflection	36.8	100	+1	0	stable	±2
5	MellumJetBrains	JetBrains	32.6	100	+1	0	stable	±2
6	QwQ 32BAlibaba	Alibaba	47.1	100	0	-3	stable	±2
7	Seed 1.6 FlashByteDance	ByteDance	85.0	100	0	-3	stable	±2
8	Olmo 3 7B InstructAllen AI	Allen AI	69.0	99	0	-3	stable	±2
9	Mistral Large 2411Mistral AI	Mistral AI	49.9	99	0	-3	stable	±2
10	Gemma 3n 4BGoogle	Google	46.3	98	0	+3	stable	±2
11	Command R+ (08-2024)Cohere	Cohere	47.8	98	0	+3	stable	±2
12	Llama 3.2 3B Instruct (free)Meta	Meta	35.3	98	-2	0	stable	±2
13	GPT-5.4 ProOpenAI	OpenAI	94.0	97	0	+3	stable	±1
14	Granite 4.0 MicroIBM	IBM	55.1	96	-2	-1	stable	±2
15	GPT-3.5 Turbo 16kOpenAI	OpenAI	39.9	96	-1	+2	stable	±2
16	Nemotron Nano 12B 2 VLNVIDIA	NVIDIA	72.6	96	+1	+2	stable	±2
17	Nemotron 3 Super (free)NVIDIA	NVIDIA	84.1	96	+1	+2	stable	±2
18	Llama 3.1 405B (base)Meta	Meta	38.7	96	-2	-1	stable	±2
19	GPT-4o (2024-05-13)OpenAI	OpenAI	52.7	96	-2	-1	stable	±2
20	Qwen2.5 VL 32B InstructAlibaba	Alibaba	56.8	95	+1	-3	stable	±2

Most Volatile Models

Bottom 20 models with the lowest stability scores. These models show significant ranking fluctuations or inconsistent states.

#	Model	Provider	Score	Stability	24h	7d	State	Rank Spread
1	Devstral 2 2512Mistral AI	Mistral AI	67.7	35	-12	+10	fragile	±2
2	MiniMax M1MiniMax	MiniMax	68.5	35	-10	+15	fragile	±2
3	Claude Sonnet 4Anthropic	Anthropic	79.9	35	-10	+19	fragile	±2
4	UI-TARS 7B ByteDance	ByteDance	62.7	35	+15	+8	fragile	±2
5	GPT-4 TurboOpenAI	OpenAI	60.5	35	-10	+10	fragile	±2
6	Molmo2 8BAllen AI	Allen AI	67.6	35	-9	+15	fragile	±2
7	Mistral Large 3 2512Mistral AI	Mistral AI	73.5	35	+19	+9	fragile	±2
8	MiniMax M2.5 (free)MiniMax	MiniMax	83.4	35	+11	+11	fragile	±2
9	LFM2.5-1.2B-Thinking (free)Liquid AI	Liquid AI	59.0	35	+7	+12	fragile	±2
10	LFM2-24B-A2BLiquid AI	Liquid AI	53.2	35	-7	+9	fragile	±2
11	gpt-oss-20b (free)OpenAI	OpenAI	73.9	35	+21	+17	fragile	±2
12	o4 Mini Deep ResearchOpenAI	OpenAI	85.0	36	-5	+16	fragile	±2
13	GPT-5 NanoOpenAI	OpenAI	75.7	36	-5	+22	fragile	±2
14	Seed 1.6ByteDance	ByteDance	85.0	36	-7	+10	fragile	±2
15	Seed-2.0-MiniByteDance	ByteDance	85.0	36	-7	+7	fragile	±2
16	Seed-2.0-LiteByteDance	ByteDance	85.0	36	-8	+11	fragile	±2
17	GPT-4.1 NanoOpenAI	OpenAI	80.8	36	+12	+14	fragile	±2
18	GPT-5.3 ChatOpenAI	OpenAI	85.0	36	+30	+16	fragile	±2
19	Qwen Plus 0728Alibaba	Alibaba	77.0	36	-8	+18	fragile	±2
20	GPT Audio MiniOpenAI	OpenAI	68.4	36	+10	+10	fragile	±2

Stability by Provider

Aggregated stability metrics per provider. Providers are ranked by their average stability score across all models.

Provider	Models	Avg Stability	Most Stable Model	Most Volatile Model
eleutherai	1	100.0	Llemma 7b(100)	Llemma 7b(100)
JetBrains	1	100.0	Mellum(100)	Mellum(100)
Inflection	2	96.5	Inflection 3 Productivity(100)	Inflection 3 Pi(93)
IBM	1	96.1	Granite 4.0 Micro(96)	Granite 4.0 Micro(96)
Windsurf	1	82.8	SWE-1.5(83)	SWE-1.5(83)
essentialai	1	77.8	Rnj 1 Instruct(78)	Rnj 1 Instruct(78)
AI21 Labs	1	75.7	Jamba Large 1.7(76)	Jamba Large 1.7(76)
Microsoft	1	74.4	Phi 4(74)	Phi 4(74)
Cohere	4	74.3	Command R+ (08-2024)(98)	Command A(38)
Allen AI	7	71.2	Olmo 3 7B Instruct(99)	Molmo2 8B(35)
aion-labs	3	69.5	Aion-1.0-Mini(91)	Aion-2.0(38)
Vercel	1	69.5	autofixer-01(69)	autofixer-01(69)
Meta	13	65.4	Llama 3.2 3B Instruct (free)(98)	Llama 4 Maverick(38)
Moonshot AI	4	65.1	Kimi K2.5(83)	Kimi K2 Thinking(38)
Amazon	5	64.8	Nova Micro 1.0(86)	Nova Premier 1.0(41)
Google	23	63.0	Gemma 3n 4B(98)	Gemma 3 4B (free)(36)
OpenAI	59	62.7	GPT-5.4(100)	GPT-4 Turbo(35)
Anthropic	13	60.0	Claude Opus 4.5(93)	Claude Sonnet 4(35)
MiniMax	8	60.0	MiniMax-01(90)	MiniMax M1(35)
Perplexity	5	59.6	Sonar(79)	Sonar Deep Research(36)
Mistral AI	25	59.4	Mistral Large 2411(99)	Devstral 2 2512(35)
NVIDIA	11	59.4	Nemotron Nano 12B 2 VL(96)	Nemotron 3 Super(36)
Baidu	5	56.9	ERNIE 4.5 21B A3B(78)	ERNIE 4.5 VL 28B A3B(36)
Alibaba	51	55.8	QwQ 32B(100)	Qwen Plus 0728(36)
arcee-ai	7	54.0	Coder Large(89)	Trinity Mini(36)
xAI	10	50.8	Grok 4.20 Beta(78)	Grok 4.1 Fast(36)
Meituan	1	50.6	LongCat Flash Chat(51)	LongCat Flash Chat(51)
Liquid AI	5	49.4	LFM2-2.6B(81)	LFM2.5-1.2B-Thinking (free)(35)
DeepSeek	11	49.4	DeepSeek V3(83)	DeepSeek V3.2 Exp(36)
ByteDance	5	48.3	Seed 1.6 Flash(100)	UI-TARS 7B (35)
Xiaomi	3	46.2	MiMo-V2-Omni(50)	MiMo-V2-Flash(38)
Kuaishou	1	45.1	KAT-Coder-Pro V1(45)	KAT-Coder-Pro V1(45)
Cursor	2	45.1	Composer 2(53)	Composer 2 Fast(38)
Inception	3	38.2	Mercury(41)	Mercury 2(36)
Upstage	1	38.2	Solar Pro 3(38)	Solar Pro 3(38)
Tencent	1	37.5	Hunyuan A13B Instruct(38)	Hunyuan A13B Instruct(38)
StepFun	2	37.4	Step 3.5 Flash (free)(37)	Step 3.5 Flash (free)(37)
deepcogito	1	36.0	Cogito v2.1 671B(36)	Cogito v2.1 671B(36)
Writer	1	35.9	Palmyra X5(36)	Palmyra X5(36)

Stability Distribution

How stability scores are distributed across all 300 tracked models.

0–10

10–20

20–30

30–40

105

40–50

50–60

60–70

70–80

80–90

90–100

What Makes a Model Stable?

Our stability scoring system uses three key signals to measure how consistently a model performs over time.

Rank Consistency

The most direct measure of stability. Models lose up to 25 points for large 24-hour rank changes (5 points per rank position moved) and up to 21 points for 7-day changes (3 points per position). Models that hold their rank tightly score higher.

State Classification

Each model has a state reflecting its overall reliability. Models in a "stable" state receive a 10-point bonus, while "fragile" models are penalized 15 points. This captures systemic reliability beyond simple rank movement.

Sparkline Volatility

The 14-day sparkline data reveals hidden volatility. We compute the standard deviation of the sparkline and subtract up to 20 points. Even models that end where they started can be penalized if they oscillated wildly along the way.

All Trackers

Coding, image, and video model trackers

Degradation Tracker

Detect models with declining performance

Coding Tracker

Daily coding model performance and rankings

Frequently Asked Questions

The stability score starts at 100 and is reduced based on three factors: 24-hour rank changes (up to -25 points, at 5 per position moved), 7-day rank changes (up to -21 points, at 3 per position), and sparkline volatility measured by standard deviation (up to -20 points). Models in a "stable" state get a +10 bonus, while "fragile" models lose 15 points.

Models are classified into four tiers based on their stability score: "Rock Solid" (85-100) means extremely consistent performance with minimal fluctuation. "Consistent" (70-84) means generally reliable with minor variations. "Variable" (50-69) shows noticeable ranking fluctuations. "Volatile" (below 50) indicates significant instability and unpredictable performance.

Stability indicates how predictably a model will perform over time. A highly rated but volatile model may deliver inconsistent results, which is problematic for production applications requiring reliable output quality. Stable models provide more predictable performance, making them safer choices for mission-critical workloads even if they do not always hold the top rank.

Model Stability Report

Rock Solid

Consistent

Variable

Volatile

129

Stability Classification Distribution

Provider Stability Rankings (Avg Score)

Most Stable Models

Top 20 models with the highest stability scores. These models maintain consistent rankings with minimal volatility.

#	Model	Provider	Score	Stability	24h	7d	State	Rank Spread
1	GPT-5.4OpenAI	OpenAI	94.0	100	0	-1	stable	±2
2	Llemma 7beleutherai	eleutherai	47.5	100	0	-1	stable	±2
3	GPT-3.5 Turbo (older v0613)OpenAI	OpenAI	38.0	100	-1	0	stable	±2
4	Inflection 3 ProductivityInflection	Inflection	36.8	100	+1	0	stable	±2
5	MellumJetBrains	JetBrains	32.6	100	+1	0	stable	±2
6	QwQ 32BAlibaba	Alibaba	47.1	100	0	-3	stable	±2
7	Seed 1.6 FlashByteDance	ByteDance	85.0	100	0	-3	stable	±2
8	Olmo 3 7B InstructAllen AI	Allen AI	69.0	99	0	-3	stable	±2
9	Mistral Large 2411Mistral AI	Mistral AI	49.9	99	0	-3	stable	±2
10	Gemma 3n 4BGoogle	Google	46.3	98	0	+3	stable	±2
11	Command R+ (08-2024)Cohere	Cohere	47.8	98	0	+3	stable	±2
12	Llama 3.2 3B Instruct (free)Meta	Meta	35.3	98	-2	0	stable	±2
13	GPT-5.4 ProOpenAI	OpenAI	94.0	97	0	+3	stable	±1
14	Granite 4.0 MicroIBM	IBM	55.1	96	-2	-1	stable	±2
15	GPT-3.5 Turbo 16kOpenAI	OpenAI	39.9	96	-1	+2	stable	±2
16	Nemotron Nano 12B 2 VLNVIDIA	NVIDIA	72.6	96	+1	+2	stable	±2
17	Nemotron 3 Super (free)NVIDIA	NVIDIA	84.1	96	+1	+2	stable	±2
18	Llama 3.1 405B (base)Meta	Meta	38.7	96	-2	-1	stable	±2
19	GPT-4o (2024-05-13)OpenAI	OpenAI	52.7	96	-2	-1	stable	±2
20	Qwen2.5 VL 32B InstructAlibaba	Alibaba	56.8	95	+1	-3	stable	±2

Most Volatile Models

Bottom 20 models with the lowest stability scores. These models show significant ranking fluctuations or inconsistent states.

#	Model	Provider	Score	Stability	24h	7d	State	Rank Spread
1	Devstral 2 2512Mistral AI	Mistral AI	67.7	35	-12	+10	fragile	±2
2	MiniMax M1MiniMax	MiniMax	68.5	35	-10	+15	fragile	±2
3	Claude Sonnet 4Anthropic	Anthropic	79.9	35	-10	+19	fragile	±2
4	UI-TARS 7B ByteDance	ByteDance	62.7	35	+15	+8	fragile	±2
5	GPT-4 TurboOpenAI	OpenAI	60.5	35	-10	+10	fragile	±2
6	Molmo2 8BAllen AI	Allen AI	67.6	35	-9	+15	fragile	±2
7	Mistral Large 3 2512Mistral AI	Mistral AI	73.5	35	+19	+9	fragile	±2
8	MiniMax M2.5 (free)MiniMax	MiniMax	83.4	35	+11	+11	fragile	±2
9	LFM2.5-1.2B-Thinking (free)Liquid AI	Liquid AI	59.0	35	+7	+12	fragile	±2
10	LFM2-24B-A2BLiquid AI	Liquid AI	53.2	35	-7	+9	fragile	±2
11	gpt-oss-20b (free)OpenAI	OpenAI	73.9	35	+21	+17	fragile	±2
12	o4 Mini Deep ResearchOpenAI	OpenAI	85.0	36	-5	+16	fragile	±2
13	GPT-5 NanoOpenAI	OpenAI	75.7	36	-5	+22	fragile	±2
14	Seed 1.6ByteDance	ByteDance	85.0	36	-7	+10	fragile	±2
15	Seed-2.0-MiniByteDance	ByteDance	85.0	36	-7	+7	fragile	±2
16	Seed-2.0-LiteByteDance	ByteDance	85.0	36	-8	+11	fragile	±2
17	GPT-4.1 NanoOpenAI	OpenAI	80.8	36	+12	+14	fragile	±2
18	GPT-5.3 ChatOpenAI	OpenAI	85.0	36	+30	+16	fragile	±2
19	Qwen Plus 0728Alibaba	Alibaba	77.0	36	-8	+18	fragile	±2
20	GPT Audio MiniOpenAI	OpenAI	68.4	36	+10	+10	fragile	±2

Stability by Provider

Aggregated stability metrics per provider. Providers are ranked by their average stability score across all models.

Provider	Models	Avg Stability	Most Stable Model	Most Volatile Model
eleutherai	1	100.0	Llemma 7b(100)	Llemma 7b(100)
JetBrains	1	100.0	Mellum(100)	Mellum(100)
Inflection	2	96.5	Inflection 3 Productivity(100)	Inflection 3 Pi(93)
IBM	1	96.1	Granite 4.0 Micro(96)	Granite 4.0 Micro(96)
Windsurf	1	82.8	SWE-1.5(83)	SWE-1.5(83)
essentialai	1	77.8	Rnj 1 Instruct(78)	Rnj 1 Instruct(78)
AI21 Labs	1	75.7	Jamba Large 1.7(76)	Jamba Large 1.7(76)
Microsoft	1	74.4	Phi 4(74)	Phi 4(74)
Cohere	4	74.3	Command R+ (08-2024)(98)	Command A(38)
Allen AI	7	71.2	Olmo 3 7B Instruct(99)	Molmo2 8B(35)
aion-labs	3	69.5	Aion-1.0-Mini(91)	Aion-2.0(38)
Vercel	1	69.5	autofixer-01(69)	autofixer-01(69)
Meta	13	65.4	Llama 3.2 3B Instruct (free)(98)	Llama 4 Maverick(38)
Moonshot AI	4	65.1	Kimi K2.5(83)	Kimi K2 Thinking(38)
Amazon	5	64.8	Nova Micro 1.0(86)	Nova Premier 1.0(41)
Google	23	63.0	Gemma 3n 4B(98)	Gemma 3 4B (free)(36)
OpenAI	59	62.7	GPT-5.4(100)	GPT-4 Turbo(35)
Anthropic	13	60.0	Claude Opus 4.5(93)	Claude Sonnet 4(35)
MiniMax	8	60.0	MiniMax-01(90)	MiniMax M1(35)
Perplexity	5	59.6	Sonar(79)	Sonar Deep Research(36)
Mistral AI	25	59.4	Mistral Large 2411(99)	Devstral 2 2512(35)
NVIDIA	11	59.4	Nemotron Nano 12B 2 VL(96)	Nemotron 3 Super(36)
Baidu	5	56.9	ERNIE 4.5 21B A3B(78)	ERNIE 4.5 VL 28B A3B(36)
Alibaba	51	55.8	QwQ 32B(100)	Qwen Plus 0728(36)
arcee-ai	7	54.0	Coder Large(89)	Trinity Mini(36)
xAI	10	50.8	Grok 4.20 Beta(78)	Grok 4.1 Fast(36)
Meituan	1	50.6	LongCat Flash Chat(51)	LongCat Flash Chat(51)
Liquid AI	5	49.4	LFM2-2.6B(81)	LFM2.5-1.2B-Thinking (free)(35)
DeepSeek	11	49.4	DeepSeek V3(83)	DeepSeek V3.2 Exp(36)
ByteDance	5	48.3	Seed 1.6 Flash(100)	UI-TARS 7B (35)
Xiaomi	3	46.2	MiMo-V2-Omni(50)	MiMo-V2-Flash(38)
Kuaishou	1	45.1	KAT-Coder-Pro V1(45)	KAT-Coder-Pro V1(45)
Cursor	2	45.1	Composer 2(53)	Composer 2 Fast(38)
Inception	3	38.2	Mercury(41)	Mercury 2(36)
Upstage	1	38.2	Solar Pro 3(38)	Solar Pro 3(38)
Tencent	1	37.5	Hunyuan A13B Instruct(38)	Hunyuan A13B Instruct(38)
StepFun	2	37.4	Step 3.5 Flash (free)(37)	Step 3.5 Flash (free)(37)
deepcogito	1	36.0	Cogito v2.1 671B(36)	Cogito v2.1 671B(36)
Writer	1	35.9	Palmyra X5(36)	Palmyra X5(36)

Stability Distribution

How stability scores are distributed across all 300 tracked models.

0–10

10–20

20–30

30–40

105

40–50

50–60

60–70

70–80

80–90

90–100

What Makes a Model Stable?

Our stability scoring system uses three key signals to measure how consistently a model performs over time.

Rank Consistency

State Classification

Sparkline Volatility

All Trackers

Coding, image, and video model trackers

Degradation Tracker

Detect models with declining performance

Coding Tracker

Daily coding model performance and rankings

Frequently Asked Questions

Model Stability Report

Stability Classification Distribution

Provider Stability Rankings (Avg Score)

Most Stable Models

Most Volatile Models

Stability by Provider

Stability Distribution

What Makes a Model Stable?

Rank Consistency

State Classification

Sparkline Volatility

Related

Model Stability Report

Stability Classification Distribution

Provider Stability Rankings (Avg Score)

Most Stable Models

Most Volatile Models

Stability by Provider

Stability Distribution

What Makes a Model Stable?

Rank Consistency

State Classification

Sparkline Volatility

Related