Not all rank changes are meaningful. Some are random noise. This page uses statistical analysis to tell you which model score movements are real trends vs. normal fluctuation, so you know which changes to pay attention to.
Models Analyzed
300
Significant Changes
32
Noise (Not Significant)
268
Both Timeframes
125
32 models whose recent scores deviate enough from their historical average to be considered a real change (not noise). Sorted by how extreme the change is. Z-Score measures how unusual the change is - values beyond ±1.96 mean there's a 95% chance the change is real.
A model changing rank in 24 hours could be a blip. But if it's also moving over 7 days, that's a real trend. Models flagged on both timeframes are the most important to watch - they represent confirmed, sustained performance shifts.
Some models have naturally stable scores - even a small rank change for these models is meaningful. Others have volatile scores that bounce around - they need a bigger shift before you should care. CV% (coefficient of variation) tells you how volatile each model is. Higher = noisier.
| Model | Score | CV% |
|---|---|---|
| WizardLM-2 8x22BMicrosoft | 32.2 | 5.5% |
| Mixtral 8x22B InstructMistral AI | 37.1 | 5.4% |
| Llama 3.2 3B Instruct (free)Meta | 35.2 | 5.4% |
| autofixer-01Vercel | 38.8 | 4.8% |
| GPT-4OpenAI | 39.0 | 4.8% |
| Mixtral 8x7B InstructMistral AI | 42.4 | 4.7% |
| Claude 3 HaikuAnthropic | 43.0 | 4.7% |
| GPT-3.5 Turbo (older v0613)OpenAI | 38.0 | 4.3% |
| Olmo 2 32B InstructAllen AI | 44.5 | 4.2% |
| GPT-4 Turbo (older v1106)OpenAI | 42.7 | 4.2% |
| Llama Guard 3 8BMeta | 42.9 | 3.9% |
| Qwen2.5 Coder 7B InstructAlibaba | 42.9 | 3.8% |
| Command R7B (12-2024)Cohere | 44.7 | 3.6% |
| GPT-4o (2024-08-06)OpenAI | 55.6 | 3.6% |
| LFM2-8B-A1BLiquid AI | 53.2 | 3.6% |
| SWE-1.5Windsurf | 49.2 | 3.6% |
| Devstral MediumMistral AI | 62.6 | 3.4% |
| Maestro Reasoningarcee-ai | 55.6 | 3.4% |
| QwQ 32BAlibaba | 47.0 | 3.3% |
| Mistral Small 3Mistral AI | 59.5 | 3.2% |
| Model | Score | CV% |
|---|---|---|
| Grok 4 FastxAI | 83.3 | 0.4% |
| Sonar Pro SearchPerplexity | 85.0 | 0.5% |
| Gemini 3.1 Pro PreviewGoogle | 85.5 | 0.5% |
| Gemini 3.1 Pro Preview Custom ToolsGoogle | 85.0 | 0.5% |
| Nova 2 LiteAmazon | 72.7 | 0.5% |
| Nova Premier 1.0Amazon | 77.8 | 0.6% |
| Gemini 2.5 ProGoogle | 84.8 | 0.6% |
| Solar Pro 3Upstage | 72.5 | 0.6% |
| Grok 3 MinixAI | 76.2 | 0.6% |
| Gemini 2.5 Flash LiteGoogle | 81.4 | 0.6% |
| Claude 3.7 SonnetAnthropic | 77.1 | 0.6% |
| Gemini 2.5 FlashGoogle | 80.1 | 0.6% |
| Claude Opus 4Anthropic | 81.7 | 0.6% |
| Nemotron 3 Super (free)NVIDIA | 84.1 | 0.6% |
| Claude Opus 4.6Anthropic | 92.1 | 0.6% |
| Qwen3 Coder PlusAlibaba | 78.6 | 0.6% |
| Qwen3.5 397B A17BAlibaba | 81.8 | 0.6% |
| Claude Opus 4.5Anthropic | 90.4 | 0.6% |
| Qwen3 Coder NextAlibaba | 76.7 | 0.6% |
| Gemma 3 4BGoogle | 56.2 | 0.7% |
Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.
We use z-scores with a 95% confidence threshold (|z| > 1.96). A z-score measures how many standard deviations a model's current score is from its historical baseline. Only changes exceeding 1.96 standard deviations are flagged as statistically significant.
The baseline is computed as the arithmetic mean of each model's 14-day sparkline data. This rolling average smooths out daily fluctuations and provides a stable reference point for detecting meaningful deviations.
Each model's 95% confidence interval is calculated as baseline ± 1.96 × standard deviation. Scores falling outside this range indicate a statistically meaningful change. The "Confidence" column shows the ± threshold value.
Daily (24h) and weekly (7d) rank changes are analyzed separately. Daily significance requires a rank shift of more than 3 positions; weekly requires more than 5. Models significant on both timeframes represent the strongest, most reliable signals.
The coefficient of variation (CV%) measures relative volatility. High-CV models have naturally noisy scores and require larger absolute changes to be significant. Low-CV models are more predictable, so even small deviations may represent real shifts.
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.