AI models ranked by response latency and throughput. 50 LLMs compared by time-to-first-token (ms) and output speed (tokens/sec). Data sourced from provider benchmarks and updated regularly.
Qwen3 VL 30B A3B Thinking
304ms
949ms
across 50 LLMs
315 tok/s
peak output speed
904ms
50th percentile
Coding and language models sorted by time-to-first-token latency (lowest first).
Generation time for media models (typically measured as full generation, not time-to-first-token).
AI model latency varies dramatically based on model size, architecture, and whether reasoning/chain-of-thought is enabled.
< 100ms
The fastest models available, optimized for real-time interactions. Small parameter counts and aggressive quantization deliver sub-100ms first-token latency. Includes models like Claude Haiku, Gemini Flash, and compact Qwen variants.
Best for: Autocomplete, code completion, chat UIs, real-time assistants
100ms - 500ms
Strong balance of speed and capability. These mid-sized models handle complex tasks while maintaining snappy response times. Includes models like Claude Sonnet, Codestral, and Llama Scout.
Best for: Production chatbots, coding assistants, search augmentation
500ms - 2s
Full-capability frontier models that prioritize output quality over raw speed. Most flagship models from major providers fall here. Includes GPT-5, Gemini Pro, and Grok.
Best for: Complex analysis, long-form generation, document processing
> 2s
Reasoning-heavy models that take extra time to think through problems. Chain-of-thought and extended reasoning add latency but significantly improve accuracy on hard tasks. Includes o3, DeepSeek R1, and Claude Opus.
Best for: Math, research, multi-step reasoning, agentic workflows
The fastest models (sub-100ms) like Claude Haiku and Gemini Flash are optimized for speed with smaller parameter counts. They score well on simple tasks but trail frontier models on complex reasoning, multi-step coding, and nuanced analysis. For most production chatbot and autocomplete workloads, this trade-off is worthwhile.
Models in the 200-500ms range -- like Claude Sonnet, Codestral, and Llama 4 Scout -- deliver 85-95% of frontier quality at 3-5x the speed. These are the workhorses for production applications where both quality and responsiveness matter.
Models like o3 and DeepSeek R1 use chain-of-thought reasoning that adds 2-10x latency. This is by design: the extra "thinking time" produces significantly better results on math, logic, and multi-step problems. Use these when accuracy matters more than speed.
If you are processing thousands of requests in parallel, tokens per second (throughput) is a better metric than time-to-first-token. A model with high throughput but moderate latency will finish large batch jobs faster than an ultra-low-latency model with lower throughput.
The same model can have different latency depending on which API provider you use and current load. Peak hours (US business hours) typically see 20-50% higher latency. Consider using multiple providers or routing services for consistent performance.
Even high-latency models feel responsive when streaming tokens. A model with 1.2s first-token latency but high throughput can feel faster than a 300ms model with low throughput. Enable streaming for all user-facing applications.
Compare models across speed, quality, pricing, and more. Find the right model for your specific use case.
The fastest AI models by time-to-first-token include Groq-hosted Llama (under 100ms TTFT), Gemini Flash, and GPT-4o Mini. Speed depends on both the model architecture and the inference provider.
Not necessarily. Some models like Gemini Flash offer both speed and quality. However, reasoning models (o3, DeepSeek R1) are intentionally slower because they spend more compute time thinking, which improves accuracy on hard problems.
Three main factors: model size (smaller = faster), inference hardware (GPUs, custom chips), and provider optimization (batching, quantization, speculative decoding). The same model can be 5-10x faster on different providers.