Transparent scoring methodology

How We Rank AI Models

Every model on AI Models Map receives a composite score from 0 to 100, computed from six weighted dimensions. The score is designed to be transparent, reproducible, and useful for real-world model selection. No black boxes, no pay-to-rank.

331+

Models scored

51+

Providers tracked

Free models

Hourly

Refresh cadence

The Composite Score

Each model's final score is a weighted sum of six normalized dimensions. Every dimension is scored 0–100 independently, then multiplied by its weight to produce the composite. Here is the breakdown:

Capabilities

25% weight

Measures the breadth of a model's feature set: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Models with more capabilities score higher.

Pricing Tier

25% weight

Evaluates cost efficiency based on input and output token pricing. Free models score highest; expensive models are penalized. Reflects real API pricing from OpenRouter.

Context Window

15% weight

Scores the model's context window size relative to the field. Larger context windows enable processing of longer documents, entire codebases, and complex multi-turn conversations.

Recency

15% weight

Rewards recently released models. Newer models benefit from the latest research and training techniques. This signal decays over time, reflecting the fast pace of AI development.

Output Capacity

10% weight

Measures maximum output token length. Models that can generate longer responses score higher, which matters for code generation, long-form content, and detailed analysis tasks.

Versatility

10% weight

Assesses multimodal flexibility by counting supported input and output modalities. Models that handle text, images, audio, and video across both directions score highest.

Data Sources

All model data is sourced from the OpenRouter API, which aggregates models from 51+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more.

Hourly Refresh

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

331+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 155 are open source and 25 are free to use.

Canonical Data

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

Benchmark Integration

Performance scores incorporate results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

Benchmark scores are normalized to a 0–100 scale and aggregated with task-specific weights. Explore all benchmarks on the benchmarks page.

SignalScore Breakdown

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

Leaderboards Compare models Benchmark data

Score Calculation Formula

CompositeScore =

Capabilities x 0.25

+ Pricing x 0.25

+ Context x 0.15

+ Recency x 0.15

+ Output x 0.10

+ Versatility x 0.10

Each dimension is independently normalized to 0–100 before weighting. The final composite is also clamped to 0–100. All normalization uses min-max scaling against the current model population, meaning scores are relative to the field—not absolute.

Methodology FAQ

Scores are recalculated every hour using live data from the OpenRouter API. When a new model is released or pricing changes, the update is reflected within the next refresh cycle.

Capabilities and pricing are the two most impactful factors for real-world model selection. A model needs strong features to be useful, and competitive pricing to be practical. Equal weighting ensures neither dominates - a cheap model with poor capabilities will not outrank a well-featured model at a fair price.

Yes. Free models receive the highest possible pricing tier score (25% of the composite). If they also have strong capabilities, a decent context window, and recent release dates, they can and do outrank more expensive models. This is by design - pricing accessibility matters.

Each model is scored based on its canonical capabilities and the best available pricing across providers. The OpenRouter API aggregates availability from multiple endpoints, so our data reflects the most competitive offering for each model.

The Composite Score

Capabilities

25% weight

Measures the breadth of a model's feature set: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Models with more capabilities score higher.

Pricing Tier

25% weight

Evaluates cost efficiency based on input and output token pricing. Free models score highest; expensive models are penalized. Reflects real API pricing from OpenRouter.

Context Window

15% weight

Scores the model's context window size relative to the field. Larger context windows enable processing of longer documents, entire codebases, and complex multi-turn conversations.

Recency

15% weight

Rewards recently released models. Newer models benefit from the latest research and training techniques. This signal decays over time, reflecting the fast pace of AI development.

Output Capacity

10% weight

Measures maximum output token length. Models that can generate longer responses score higher, which matters for code generation, long-form content, and detailed analysis tasks.

Versatility

10% weight

Assesses multimodal flexibility by counting supported input and output modalities. Models that handle text, images, audio, and video across both directions score highest.

Data Sources

All model data is sourced from the OpenRouter API, which aggregates models from 51+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more.

Hourly Refresh

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

331+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 155 are open source and 25 are free to use.

Canonical Data

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

Benchmark Integration

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

Benchmark scores are normalized to a 0–100 scale and aggregated with task-specific weights. Explore all benchmarks on the benchmarks page.

SignalScore Breakdown

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

Score Calculation Formula

CompositeScore =

Capabilities x 0.25

+ Pricing x 0.25

+ Context x 0.15

+ Recency x 0.15

+ Output x 0.10

+ Versatility x 0.10