Every model on AI Models Map receives a composite score from 0 to 100, computed from six weighted dimensions. The score is designed to be transparent, reproducible, and useful for real-world model selection. No black boxes, no pay-to-rank.
331+
Models scored
51+
Providers tracked
25
Free models
Hourly
Refresh cadence
Each model's final score is a weighted sum of six normalized dimensions. Every dimension is scored 0–100 independently, then multiplied by its weight to produce the composite. Here is the breakdown:
Measures the breadth of a model's feature set: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Models with more capabilities score higher.
Evaluates cost efficiency based on input and output token pricing. Free models score highest; expensive models are penalized. Reflects real API pricing from OpenRouter.
Scores the model's context window size relative to the field. Larger context windows enable processing of longer documents, entire codebases, and complex multi-turn conversations.
Rewards recently released models. Newer models benefit from the latest research and training techniques. This signal decays over time, reflecting the fast pace of AI development.
Measures maximum output token length. Models that can generate longer responses score higher, which matters for code generation, long-form content, and detailed analysis tasks.
Assesses multimodal flexibility by counting supported input and output modalities. Models that handle text, images, audio, and video across both directions score highest.
All model data is sourced from the OpenRouter API, which aggregates models from 51+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more.
A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.
Coverage spans coding models, image generation, video generation, and multimodal models. 155 are open source and 25 are free to use.
Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.
Performance scores incorporate results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.
Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.
Code generation benchmark measuring functional correctness of synthesized programs from docstrings.
Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.
Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.
Grade school math word problems testing multi-step mathematical reasoning.
Competition-level mathematics problems requiring advanced problem-solving.
Benchmark scores are normalized to a 0–100 scale and aggregated with task-specific weights. Explore all benchmarks on the benchmarks page.
Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.
Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.
How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.
Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.
Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.
Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.
Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.
CompositeScore =
Capabilities x 0.25
+ Pricing x 0.25
+ Context x 0.15
+ Recency x 0.15
+ Output x 0.10
+ Versatility x 0.10
Each dimension is independently normalized to 0–100 before weighting. The final composite is also clamped to 0–100. All normalization uses min-max scaling against the current model population, meaning scores are relative to the field—not absolute.
Scores are recalculated every hour using live data from the OpenRouter API. When a new model is released or pricing changes, the update is reflected within the next refresh cycle.
Capabilities and pricing are the two most impactful factors for real-world model selection. A model needs strong features to be useful, and competitive pricing to be practical. Equal weighting ensures neither dominates - a cheap model with poor capabilities will not outrank a well-featured model at a fair price.
Yes. Free models receive the highest possible pricing tier score (25% of the composite). If they also have strong capabilities, a decent context window, and recent release dates, they can and do outrank more expensive models. This is by design - pricing accessibility matters.
Each model is scored based on its canonical capabilities and the best available pricing across providers. The OpenRouter API aggregates availability from multiple endpoints, so our data reflects the most competitive offering for each model.