AI models ranked by response latency and throughput. 0 LLMs compared by time-to-first-token (ms) and output speed (tokens/sec). Data sourced from provider benchmarks and updated regularly.
—
—
—
across 0 LLMs
—
peak output speed
—
50th percentile
Coding and language models sorted by time-to-first-token latency (lowest first).
No latency data available.
AI model latency varies dramatically based on model size, architecture, and whether reasoning/chain-of-thought is enabled.
< 100ms
The fastest models available, optimized for real-time interactions. Small parameter counts and aggressive quantization deliver sub-100ms first-token latency. Includes models like Claude Haiku, Gemini Flash, and compact Qwen variants.
Best for: Autocomplete, code completion, chat UIs, real-time assistants
100ms — 500ms
Strong balance of speed and capability. These mid-sized models handle complex tasks while maintaining snappy response times. Includes models like Claude Sonnet, Codestral, and Llama Scout.
Best for: Production chatbots, coding assistants, search augmentation
500ms — 2s
Full-capability frontier models that prioritize output quality over raw speed. Most flagship models from major providers fall here. Includes GPT-5, Gemini Pro, and Grok.
Best for: Complex analysis, long-form generation, document processing
> 2s
Reasoning-heavy models that take extra time to think through problems. Chain-of-thought and extended reasoning add latency but significantly improve accuracy on hard tasks. Includes o3, DeepSeek R1, and Claude Opus.
Best for: Math, research, multi-step reasoning, agentic workflows
The fastest models (sub-100ms) like Claude Haiku and Gemini Flash are optimized for speed with smaller parameter counts. They score well on simple tasks but trail frontier models on complex reasoning, multi-step coding, and nuanced analysis. For most production chatbot and autocomplete workloads, this trade-off is worthwhile.
Models in the 200-500ms range -- like Claude Sonnet, Codestral, and Llama 4 Scout -- deliver 85-95% of frontier quality at 3-5x the speed. These are the workhorses for production applications where both quality and responsiveness matter.
Models like o3 and DeepSeek R1 use chain-of-thought reasoning that adds 2-10x latency. This is by design: the extra "thinking time" produces significantly better results on math, logic, and multi-step problems. Use these when accuracy matters more than speed.
If you are processing thousands of requests in parallel, tokens per second (throughput) is a better metric than time-to-first-token. A model with high throughput but moderate latency will finish large batch jobs faster than an ultra-low-latency model with lower throughput.
The same model can have different latency depending on which API provider you use and current load. Peak hours (US business hours) typically see 20-50% higher latency. Consider using multiple providers or routing services for consistent performance.
Even high-latency models feel responsive when streaming tokens. A model with 1.2s first-token latency but high throughput can feel faster than a 300ms model with low throughput. Enable streaming for all user-facing applications.
Compare models across speed, quality, pricing, and more. Find the right model for your specific use case.