Methodology

How we rank AI models. This page explains every step in plain language -- no math required. We believe rankings should be transparent, so nothing here is hidden.

The big picture

Every model on AI Models Map receives a composite score from 0 to 100 called the ModelCap Score. This score combines multiple independent signals into a single number that balances technical quality with real-world adoption.

Think of it like a credit score for AI models: no single factor determines the outcome. A model needs to perform well across several dimensions to rank highly.

How it works, in three sentences:

We collect data from multiple independent sources every hour -- benchmarks, usage metrics, community sentiment, pricing, and expert reviews.
Each data point is normalized to a 0-1 scale and weighted by category (coding models are scored differently from image generators).
Older data fades out automatically so rankings always reflect the current state of the world, not last year's news.

What we measure: the 7 signals

Every model is evaluated on seven independent dimensions. Each signal is collected from multiple sources to avoid single-source bias.

Quality

How good is the model at its job? We look at standardized benchmark scores across knowledge, coding, and reasoning tasks, Elo ratings from head-to-head arenas, and third-party evaluation results.

Sources

Benchmark suites, head-to-head arenas, third-party evaluations

Half-life

60 days

Why this half-life

Benchmark scores change slowly -- they only move when new evaluations are published.

Demand

How much interest is there in this model? We track search volume, waitlist sizes, developer documentation traffic, and trending mentions across developer communities.

Sources

Search interest trackers, developer communities, tech forums

Half-life

7 days

Why this half-life

Interest spikes fast when a new model launches and fades quickly. The short half-life captures this.

Usage

Is the model being used in production? We track API call volumes (from providers that publish this data), package downloads for model-specific libraries, and active user counts.

Sources

Provider telemetry (where public), package downloads, repository activity

Half-life

14 days

Why this half-life

Usage is a moderate-speed signal -- it takes a few weeks for teams to adopt or abandon a model.

Cost

How affordable is the model? We compare per-token prices, per-image prices, per-minute video costs, and free-tier availability. Lower cost per unit of quality pushes this signal higher.

Sources

Official provider pricing, collected daily

Half-life

30 days

Why this half-life

Pricing changes are deliberate and infrequent, but when they happen, we want them to matter quickly.

Trust

How reliable and safe is the model? We look at uptime records, safety benchmark scores (like TrustLLM), compliance certifications (SOC 2, HIPAA), and whether the provider has a public responsible-AI policy.

Sources

Uptime monitoring, safety benchmarks, compliance records

Half-life

30 days

Why this half-life

Trust is earned slowly and lost quickly. The 30-day decay balances both.

Social

What are people saying about this model? We analyze sentiment from social platforms, developer communities, and forums. Positive discussions push this signal up; widespread complaints push it down.

Sources

Social platforms, developer forums, community discussions

Half-life

7 days

Why this half-life

Social sentiment is noisy and moves fast. We give it a short half-life and a hard weight cap (max 8%) so it never dominates.

Expert

What do domain experts think? This signal incorporates structured reviews from verified AI researchers, engineers with published work, and members of our advisory council.

Sources

Verified expert reviews, advisory council assessments

Half-life

90 days

Why this half-life

Expert opinions are high-quality but rare. The long half-life ensures each review contributes for months.

The scoring formula

The full mathematical formula for computing a model's score in category c at time t:

S_c(m,t) = 100 × sigmoid(Z_c(m,t))

Where S_c is the ModelCap Score for category c, m is the model, t is the current time, and Z_c is the weighted z-score composite.

Data processing pipeline

Each raw signal value passes through this pipeline before being combined into the final score:

raw_value

Raw signal from source

log1p

Log transform to compress outliers

robust_scale

Median/IQR scaling

EMA decay

Time-weighted freshness

z-normalize

Standard deviation units

clip

Bound to [-2, +2]

weighted sum

Category-specific weights

sigmoid

Map to 0-1 range

0-100

Final ModelCap Score

How categories change the weights

Not every signal matters equally for every type of model. A coding model should be judged more on benchmark performance, while an image generation model might lean more on community sentiment and trust. Each top-level category has its own weight table:

Signal	Coding	Image Generation	Video Generation
Demand	14%	18%	16%
Usage	28%	22%	21%
Quality	30%	26%	28%
Cost	14%	12%	10%
Trust	12%	14%	15%
Social	2%	6%	8%
Expert	0%	2%	2%
Total	100%	100%	100%

Notice that coding models weight Quality (30%) and Usage (28%) most heavily, while social signals are capped at just 2%. This prevents viral hype from overriding technical merit. For image and video models, social sentiment gets more weight because community reception is a stronger quality signal in creative domains.

Time decay: EMA half-lives per signal

Every signal has a half-life controlling how quickly old data fades via Exponential Moving Average (EMA) decay. After one half-life, a data point contributes half as much to the score. This ensures rankings naturally adapt to change.

Signal	EMA Half-life	Why this duration
Demand	7 days	Interest spikes and fades fast
Social	7 days	Conversation moves quickly
Usage	14 days	Adoption takes 1-2 weeks to shift
Cost / Reliability	30 days	Price and uptime changes are deliberate
Quality	60 days	Benchmarks only update when new evals run
Expert	90 days	Expert reviews are rare and high-value

Data collection

We collect data from multiple independent sources across benchmarks, community platforms, usage metrics, and expert assessments. Sources are validated continuously and cross-referenced to ensure accuracy and prevent single-source bias.

How raw data becomes a score

Raw signal values come in wildly different units: Elo ratings are 800-1500, API calls are in the millions, and sentiment is -1 to 1. We normalize everything to a 0-1 scale so signals can be compared and combined fairly.

The normalization process works in three steps:

Z-score standardization

We calculate how many standard deviations each value is from the mean across all models. Extreme outliers are clipped to prevent any single data point from overwhelming the score.

Sigmoid mapping

The z-score is passed through a sigmoid function that maps it smoothly to the 0-1 range. This creates meaningful separation between models without extreme jumps.

Confidence weighting

Each data point has a confidence score reflecting source reliability and data freshness. Low-confidence data has less influence on the final score.

The final ModelCap Score is these normalized, weighted signals added together and scaled to 0-100.

Model lifecycle states

Every ranked model has a lifecycle state that communicates data maturity. The state affects how the score should be interpreted:

Preliminary

Model has fewer than 5 signals or less than 7 days of data. Score is provisional and may change significantly.

Stable

Model has sufficient data from multiple sources. Score is reliable within the confidence interval shown.

Fragile

Score has high variance or conflicting signals. The rank spread is wide and may shift with new data.

Held

Score is frozen pending investigation. This happens when a signal reports anomalous values.

Models typically progress from Preliminary to Stable as data accumulates. A stable model can become Fragile if its signals degrade or conflict, and may be Held if an anomaly is detected.

Anti-gaming measures: 4 layers

We take ranking integrity seriously. Our anti-gaming system operates in four complementary layers:

Input validation

Every incoming data point is validated against expected ranges and schema. Values outside of statistical norms are flagged. Z-scores are clipped to [-2, +2] to prevent extreme outliers from dominating.

Multi-signal cross-validation

No single signal can move a score by more than a set threshold. Correlated spikes across independent signals confirm real changes; isolated spikes are dampened. Social signals are hard-capped at 8% influence.

Robustness audit

Weekly automated audits re-compute scores with bootstrapped samples, dropped sources, and perturbed weights. If a model's rank is not stable under these perturbations, it is flagged as Fragile.

Per-source kill switches

Every data pipeline has health monitoring with circuit breakers. If a source starts returning suspicious data (sudden 10x spikes, all-zero values, etc.), it is automatically disabled and the model enters Held state until manual review.

Confidence and rank spreads

We do not pretend our rankings are perfectly precise. Every score comes with a confidence level (0-1) that reflects how much data we have and how consistent it is. When you see a confidence of 0.92, it means we have strong, agreeing signals. A confidence of 0.6 means the data is sparse or conflicting.

Each ranked model also has a rank spread -- the range of positions it could reasonably hold given the confidence interval. If a model is ranked #2 with a spread of #1-#3, it means the difference between it and its neighbors is within the margin of uncertainty.

This is our way of being honest: when scores are close, we tell you they are close. We never over-claim precision we do not have.

How is this different from other leaderboards?

Most AI leaderboards focus on a single dimension -- usually benchmark scores. That is useful but incomplete. A model can score perfectly on MMLU and still be too expensive for production use, or too unreliable for enterprise deployment.

AI Models Map is different in four ways:

Multi-dimensional

We combine 7 signals across quality, usage, demand, cost, trust, social, and expert opinion. No single metric can dominate.

Real-time

Rankings update hourly, not monthly. When a model drops its prices or a new benchmark comes out, you see the effect within hours.

Category-specific

Coding models are ranked with different weights than image generators. The same model can rank differently in different categories.

Transparent

Every weight, every signal, and every scoring step is documented here. You can see exactly why a model ranks where it does.

Formula versioning

Every score in our system is tagged with a formula version number. When we update weights, add new signals, or change the normalization method, we increment the version. This means you can always tell which formula produced a given score, and historical scores remain valid within their version.

Formula changes are announced in advance and documented in our methodology changelog. Platform-wide updates are tracked in the platform changelog.

Questions?

If something is unclear or you want to suggest improvements to our methodology, we want to hear from you.