How we rank AI models. This page explains every step in plain language -- no math required. We believe rankings should be transparent, so nothing here is hidden.
Every model on AI Models Map receives a composite score from 0 to 100 called the ModelCap Score. This score combines multiple independent signals into a single number that balances technical quality with real-world adoption.
Think of it like a credit score for AI models: no single factor determines the outcome. A model needs to perform well across several dimensions to rank highly.
How it works, in three sentences:
Every model is evaluated on seven independent dimensions. Each signal is collected from multiple sources to avoid single-source bias.
How good is the model at its job? We look at standardized benchmark scores across knowledge, coding, and reasoning tasks, Elo ratings from head-to-head arenas, and third-party evaluation results.
How much interest is there in this model? We track search volume, waitlist sizes, developer documentation traffic, and trending mentions across developer communities.
Is the model being used in production? We track API call volumes (from providers that publish this data), package downloads for model-specific libraries, and active user counts.
How affordable is the model? We compare per-token prices, per-image prices, per-minute video costs, and free-tier availability. Lower cost per unit of quality pushes this signal higher.
How reliable and safe is the model? We look at uptime records, safety benchmark scores (like TrustLLM), compliance certifications (SOC 2, HIPAA), and whether the provider has a public responsible-AI policy.
What are people saying about this model? We analyze sentiment from social platforms, developer communities, and forums. Positive discussions push this signal up; widespread complaints push it down.
What do domain experts think? This signal incorporates structured reviews from verified AI researchers, engineers with published work, and members of our advisory council.
The full mathematical formula for computing a model's score in category c at time t:
Where S_c is the ModelCap Score for category c, m is the model, t is the current time, and Z_c is the weighted z-score composite.
Each raw signal value passes through this pipeline before being combined into the final score:
Not every signal matters equally for every type of model. A coding model should be judged more on benchmark performance, while an image generation model might lean more on community sentiment and trust. Each top-level category has its own weight table:
| Signal | Coding | Image Generation | Video Generation |
|---|---|---|---|
| Demand | 14% | 18% | 16% |
| Usage | 28% | 22% | 21% |
| Quality | 30% | 26% | 28% |
| Cost | 14% | 12% | 10% |
| Trust | 12% | 14% | 15% |
| Social | 2% | 6% | 8% |
| Expert | 0% | 2% | 2% |
| Total | 100% | 100% | 100% |
Notice that coding models weight Quality (30%) and Usage (28%) most heavily, while social signals are capped at just 2%. This prevents viral hype from overriding technical merit. For image and video models, social sentiment gets more weight because community reception is a stronger quality signal in creative domains.
Every signal has a half-life controlling how quickly old data fades via Exponential Moving Average (EMA) decay. After one half-life, a data point contributes half as much to the score. This ensures rankings naturally adapt to change.
| Signal | EMA Half-life | Why this duration |
|---|---|---|
| Demand | 7 days | Interest spikes and fades fast |
| Social | 7 days | Conversation moves quickly |
| Usage | 14 days | Adoption takes 1-2 weeks to shift |
| Cost / Reliability | 30 days | Price and uptime changes are deliberate |
| Quality | 60 days | Benchmarks only update when new evals run |
| Expert | 90 days | Expert reviews are rare and high-value |
We collect data from multiple independent sources across benchmarks, community platforms, usage metrics, and expert assessments. Sources are validated continuously and cross-referenced to ensure accuracy and prevent single-source bias.
Raw signal values come in wildly different units: Elo ratings are 800-1500, API calls are in the millions, and sentiment is -1 to 1. We normalize everything to a 0-1 scale so signals can be compared and combined fairly.
The normalization process works in three steps:
We calculate how many standard deviations each value is from the mean across all models. Extreme outliers are clipped to prevent any single data point from overwhelming the score.
The z-score is passed through a sigmoid function that maps it smoothly to the 0-1 range. This creates meaningful separation between models without extreme jumps.
Each data point has a confidence score reflecting source reliability and data freshness. Low-confidence data has less influence on the final score.
The final ModelCap Score is these normalized, weighted signals added together and scaled to 0-100.
Every ranked model has a lifecycle state that communicates data maturity. The state affects how the score should be interpreted:
Model has fewer than 5 signals or less than 7 days of data. Score is provisional and may change significantly.
Model has sufficient data from multiple sources. Score is reliable within the confidence interval shown.
Score has high variance or conflicting signals. The rank spread is wide and may shift with new data.
Score is frozen pending investigation. This happens when a signal reports anomalous values.
Models typically progress from Preliminary to Stable as data accumulates. A stable model can become Fragile if its signals degrade or conflict, and may be Held if an anomaly is detected.
We take ranking integrity seriously. Our anti-gaming system operates in four complementary layers:
Every incoming data point is validated against expected ranges and schema. Values outside of statistical norms are flagged. Z-scores are clipped to [-2, +2] to prevent extreme outliers from dominating.
No single signal can move a score by more than a set threshold. Correlated spikes across independent signals confirm real changes; isolated spikes are dampened. Social signals are hard-capped at 8% influence.
Weekly automated audits re-compute scores with bootstrapped samples, dropped sources, and perturbed weights. If a model's rank is not stable under these perturbations, it is flagged as Fragile.
Every data pipeline has health monitoring with circuit breakers. If a source starts returning suspicious data (sudden 10x spikes, all-zero values, etc.), it is automatically disabled and the model enters Held state until manual review.
We do not pretend our rankings are perfectly precise. Every score comes with a confidence level (0-1) that reflects how much data we have and how consistent it is. When you see a confidence of 0.92, it means we have strong, agreeing signals. A confidence of 0.6 means the data is sparse or conflicting.
Each ranked model also has a rank spread -- the range of positions it could reasonably hold given the confidence interval. If a model is ranked #2 with a spread of #1-#3, it means the difference between it and its neighbors is within the margin of uncertainty.
This is our way of being honest: when scores are close, we tell you they are close. We never over-claim precision we do not have.
Most AI leaderboards focus on a single dimension -- usually benchmark scores. That is useful but incomplete. A model can score perfectly on MMLU and still be too expensive for production use, or too unreliable for enterprise deployment.
AI Models Map is different in four ways:
We combine 7 signals across quality, usage, demand, cost, trust, social, and expert opinion. No single metric can dominate.
Rankings update hourly, not monthly. When a model drops its prices or a new benchmark comes out, you see the effect within hours.
Coding models are ranked with different weights than image generators. The same model can rank differently in different categories.
Every weight, every signal, and every scoring step is documented here. You can see exactly why a model ranks where it does.
Every score in our system is tagged with a formula version number. When we update weights, add new signals, or change the normalization method, we increment the version. This means you can always tell which formula produced a given score, and historical scores remain valid within their version.
Formula changes are announced in advance and documented in our methodology changelog. Platform-wide updates are tracked in the platform changelog.
If something is unclear or you want to suggest improvements to our methodology, we want to hear from you.