Detect when AI models may be getting worse. This tracker flags models with declining rankings, fragile states, and sustained performance drops across 293 tracked models.
Models at Risk
0
Declining (7d)
0
Fragile
0
Sustained Decline
0
0 models showing signs of degradation, ranked by risk score. Higher risk scores indicate more concerning performance trends.
No models at risk
All tracked models are performing within expected parameters.
293 models with no decline and a stable ranking state. These models are performing consistently.
| # | Model | Score | 24h | 7d | State |
|---|---|---|---|---|---|
| 1 | GPT-5.4 ProOpenAI | 90.9 | 0 | 0 | stable |
| 2 | GPT-5.2 ProOpenAI | 89.9 | 0 | 0 | stable |
| 3 | GPT-5 ProOpenAI | 89.9 | 0 | 0 | stable |
| 4 | o3 ProOpenAI | 81.6 | 0 | 0 | stable |
| 5 | Claude Opus 4.1Anthropic | 81.1 | 0 | 0 | stable |
| 6 | o1-proOpenAI | 77.2 | 0 | 0 | stable |
| 7 | Claude Opus 4Anthropic | 75.5 | 0 | 0 | stable |
| 8 | o3 Deep ResearchOpenAI | 74.0 | 0 | 0 | stable |
| 9 | Claude Opus 4.6Anthropic | 70.5 | 0 | 0 | stable |
| 10 | Claude Opus 4.5Anthropic | 70.0 | 0 | 0 | stable |
| 11 | GPT-5.4OpenAI | 69.7 | 0 | 0 | stable |
| 12 | Claude Sonnet 4.5Anthropic | 69.1 | 0 | 0 | stable |
| 13 | Qwen3 VL 30B A3B ThinkingAlibaba | 68.6 | 0 | 0 | stable |
| 14 | Qwen3 VL 235B A22B ThinkingAlibaba | 68.6 | 0 | 0 | stable |
| 15 | GPT-5.2OpenAI | 68.4 | 0 | 0 | stable |
| 16 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 68.2 | 0 | 0 | stable |
| 17 | Gemini 3.1 Pro PreviewGoogle | 68.2 | 0 | 0 | stable |
| 18 | Gemini 3 Pro PreviewGoogle | 68.2 | 0 | 0 | stable |
| 19 | Claude Sonnet 4.6Anthropic | 68.0 | 0 | 0 | stable |
| 20 | GPT-5.1OpenAI | 67.4 | 0 | 0 | stable |
Showing top 20 of 293 stable models.
Our degradation detection system uses multiple signals to identify models that may be declining in quality or reliability.
Models whose 7-day rank change is worse than -2 positions. A sustained drop of more than two ranks over a week suggests the model may be losing ground to competitors or experiencing performance issues.
Models classified as "fragile" by our scoring system. These models have inconsistent performance metrics or borderline scores that could shift significantly with small changes in evaluation data.
Models declining on both the 24-hour and 7-day timeframes. When a model is losing rank on both short and medium-term windows, it indicates a persistent downward trend rather than temporary fluctuation.
The degradation risk score combines multiple signals: 7-day rank decline weighted 2x, 24-hour rank decline weighted 1x, plus 5 bonus points for fragile state. Higher scores indicate greater risk of meaningful performance degradation.