Analyzes score-per-context-token ratio across 297 AI models to find those that make the best use of their context window, output capacity, and cost.
Key efficiency metrics across all analyzed models.
Avg Overall Efficiency
7.1%
normalized across all models
Top 50 models ranked by score per million context tokens.
Efficiency breakdown across context window tiers.
Are bigger context windows correlated with higher scores?
| Tier | Avg Context | Avg Score | Avg Efficiency |
|---|---|---|---|
| Small | 11K | 46 | 5448.1 |
| Medium | 50K | 55 | 1317.1 |
| Large | 191K | 71 | 413.8 |
| Mega | 1.1M | 81 | 74.8 |
Top 20 models by output efficiency (score per 1K output tokens). Models with 16K+ output tokens are highlighted.
| Model | Score | Max Output | Output Eff. |
|---|---|---|---|
| Inflection 3 PiInflection | 37 | 1K | 35.9 |
| Inflection 3 ProductivityInflection | 37 | 1K | 35.9 |
| UI-TARS 7B ByteDance | 63 | 2K | 30.6 |
| Gemma 2 27BGoogle | 60 | 2K | 29.2 |
| MiniMax M2-herMiniMax | 59 | 2K | 29.0 |
| Gemma 3n 2B (free)Google | 58 | 2K | 28.4 |
| Gemma 3n 4B (free)Google | 56 | 2K | 27.1 |
| Jamba Large 1.7AI21 Labs | 71 | 4K | 17.4 |
| GPT-4 TurboOpenAI | 61 | 4K | 14.8 |
| GPT-4o (2024-05-13)OpenAI | 53 | 4K | 12.9 |
| Command R (08-2024)Cohere | 48 | 4K | 11.9 |
| Command R+ (08-2024)Cohere | 48 | 4K | 11.9 |
| Llemma 7beleutherai | 48 | 4K | 11.6 |
| Nova Lite 1.0Amazon | 58 | 5K | 11.4 |
| Nova Pro 1.0Amazon | 58 | 5K | 11.4 |
| Command R7B (12-2024)Cohere | 45 | 4K | 11.2 |
| Sonar Pro SearchPerplexity | 85 | 8K | 10.6 |
| Claude 3 HaikuAnthropic | 43 | 4K | 10.5 |
| GPT-4 Turbo PreviewOpenAI | 43 | 4K | 10.4 |
| GPT-4 Turbo (older v1106)OpenAI | 43 | 4K | 10.4 |
Auto-generated observations from the efficiency data.
Context Sweet Spot
Small models have the highest average efficiency at 5448.1 score/MToken across 16 models.
Output Matters
Models with 16K+ output tokens score 35% higher on average than models with smaller output limits.
Compact High Performers
0 models achieve top-20 scores with under 128K context.
Dive deeper into context windows, compare models, or explore other dimensions.
Efficiency is measured as the score-per-context-token ratio - how much ranking score a model achieves relative to its context window size. Models that score highly with smaller context windows are considered more efficient than those requiring massive context to achieve similar results.
Cost efficiency combines quality (composite score) with pricing. The most cost-efficient models achieve high benchmark scores while maintaining low per-token API costs. Free and budget-tier models that perform well are the most cost-efficient options.
Not necessarily. Our efficiency analysis shows diminishing returns beyond certain context sizes. Models with 128K tokens often score similarly to those with 1M+ tokens, meaning the extra context capacity adds cost without proportional quality gains for most use cases.