Which AI models are the most consistent over time? This report analyzes rank changes, state classifications, and sparkline volatility across 293 tracked models to produce a stability score from 0 to 100.
Rock Solid
293
Consistent
0
Variable
0
Volatile
0
Top 20 models with the highest stability scores. These models maintain consistent rankings with minimal volatility.
| # | Model | Score | Stability | 24h | 7d |
|---|---|---|---|---|---|
| 1 | GPT-5.4 ProOpenAI | 90.9 | 100 | 0 | 0 |
| 2 | GPT-5.2 ProOpenAI | 89.9 | 100 | 0 | 0 |
| 3 | GPT-5 ProOpenAI | 89.9 | 100 | 0 | 0 |
| 4 | o3 ProOpenAI | 81.6 | 100 | 0 | 0 |
| 5 | Claude Opus 4.1Anthropic | 81.1 | 100 | 0 | 0 |
| 6 | o1-proOpenAI | 77.2 | 100 | 0 | 0 |
| 7 | Claude Opus 4Anthropic | 75.5 | 100 | 0 | 0 |
| 8 | o3 Deep ResearchOpenAI | 74.0 | 100 | 0 | 0 |
| 9 | Claude Opus 4.6Anthropic | 70.5 | 100 | 0 | 0 |
| 10 | Claude Opus 4.5Anthropic | 70.0 | 100 | 0 | 0 |
| 11 | GPT-5.4OpenAI | 69.7 | 100 | 0 | 0 |
| 12 | Claude Sonnet 4.5Anthropic | 69.1 | 100 | 0 | 0 |
| 13 | Qwen3 VL 30B A3B ThinkingAlibaba | 68.6 | 100 | 0 | 0 |
| 14 | Qwen3 VL 235B A22B ThinkingAlibaba | 68.6 | 100 | 0 | 0 |
| 15 | GPT-5.2OpenAI | 68.4 | 100 | 0 | 0 |
| 16 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 68.2 | 100 | 0 | 0 |
| 17 | Gemini 3.1 Pro PreviewGoogle | 68.2 | 100 | 0 | 0 |
| 18 | Gemini 3 Pro PreviewGoogle | 68.2 | 100 | 0 | 0 |
| 19 | Claude Sonnet 4.6Anthropic | 68.0 | 100 | 0 | 0 |
| 20 | GPT-5.1OpenAI | 67.4 | 100 | 0 | 0 |
Bottom 20 models with the lowest stability scores. These models show significant ranking fluctuations or inconsistent states.
| # | Model | Score | Stability | 24h | 7d |
|---|---|---|---|---|---|
| 1 | Mistral 7B Instruct v0.1Mistral AI | 17.2 | 100 | 0 | 0 |
| 2 | LlamaGuard 2 8BMeta | 20.1 | 100 | 0 | 0 |
| 3 | Gemma 2 9BGoogle | 21.3 | 100 | 0 | 0 |
| 4 | GPT-3.5 Turbo InstructOpenAI | 25.6 | 100 | 0 | 0 |
| 5 | Llama 3.2 1B InstructMeta | 25.9 | 100 | 0 | 0 |
| 6 | WizardLM-2 8x22BMicrosoft | 26.1 | 100 | 0 | 0 |
| 7 | Llama 3.2 3B InstructMeta | 26.2 | 100 | 0 | 0 |
| 8 | Llama 3 70B InstructMeta | 27.7 | 100 | 0 | 0 |
| 9 | Gemma 2 27BGoogle | 29.0 | 100 | 0 | 0 |
| 10 | GPT-3.5 Turbo (older v0613)OpenAI | 29.2 | 100 | 0 | 0 |
| 11 | Mistral LargeMistral AI | 29.7 | 100 | 0 | 0 |
| 12 | Qwen2.5-VL 7B InstructAlibaba | 29.7 | 100 | 0 | 0 |
| 13 | Inflection 3 ProductivityInflection | 29.7 | 100 | 0 | 0 |
| 14 | Inflection 3 PiInflection | 29.7 | 100 | 0 | 0 |
| 15 | Mixtral 8x22B InstructMistral AI | 30.2 | 100 | 0 | 0 |
| 16 | Llama 3.1 405B (base)Meta | 30.2 | 100 | 0 | 0 |
| 17 | GPT-3.5 TurboOpenAI | 30.5 | 100 | 0 | 0 |
| 18 | Llama Guard 3 8BMeta | 30.5 | 100 | 0 | 0 |
| 19 | GPT-3.5 Turbo 16kOpenAI | 31.1 | 100 | 0 | 0 |
| 20 | Qwen2.5 Coder 32B InstructAlibaba | 31.1 | 100 | 0 | 0 |
Aggregated stability metrics per provider. Providers are ranked by their average stability score across all models.
| Provider | Models | Avg Stability |
|---|---|---|
| OpenAI | 59 | 100.0 |
| Anthropic | 13 | 100.0 |
| Alibaba | 51 | 100.0 |
| 24 | 100.0 | |
| NVIDIA | 8 | 100.0 |
| xAI | 8 | 100.0 |
| ByteDance | 4 | 100.0 |
| Perplexity | 5 | 100.0 |
| Amazon | 5 | 100.0 |
| Moonshot AI | 5 | 100.0 |
| StepFun | 2 | 100.0 |
| MiniMax | 6 | 100.0 |
| arcee-ai | 7 | 100.0 |
| Xiaomi | 1 | 100.0 |
| DeepSeek | 12 | 100.0 |
| Mistral AI | 25 | 100.0 |
| Inception | 3 | 100.0 |
| Meta | 17 | 100.0 |
| Baidu | 5 | 100.0 |
| Kuaishou | 1 | 100.0 |
| Meituan | 1 | 100.0 |
| AI21 Labs | 1 | 100.0 |
| Allen AI | 7 | 100.0 |
| Tencent | 1 | 100.0 |
| Upstage | 1 | 100.0 |
| Liquid AI | 5 | 100.0 |
| aion-labs | 3 | 100.0 |
| Writer | 1 | 100.0 |
| deepcogito | 1 | 100.0 |
| Cohere | 4 | 100.0 |
| essentialai | 1 | 100.0 |
| IBM | 1 | 100.0 |
| Microsoft | 2 | 100.0 |
| eleutherai | 1 | 100.0 |
| Inflection | 2 | 100.0 |
How stability scores are distributed across all 293 tracked models.
Our stability scoring system uses three key signals to measure how consistently a model performs over time.
The most direct measure of stability. Models lose up to 25 points for large 24-hour rank changes (5 points per rank position moved) and up to 21 points for 7-day changes (3 points per position). Models that hold their rank tightly score higher.
Each model has a state reflecting its overall reliability. Models in a "stable" state receive a 10-point bonus, while "fragile" models are penalized 15 points. This captures systemic reliability beyond simple rank movement.
The 14-day sparkline data reveals hidden volatility. We compute the standard deviation of the sparkline and subtract up to 20 points. Even models that end where they started can be penalized if they oscillated wildly along the way.