Fastest AI Models

AI models ranked by response latency and throughput. 50 LLMs compared by time-to-first-token (ms) and output speed (tokens/sec). Data sourced from provider benchmarks and updated regularly.

Fastest Model

Qwen3 VL 30B A3B Thinking

304ms

Avg Latency

949ms

across 50 LLMs

Max Throughput

315 tok/s

peak output speed

Median Latency

904ms

50th percentile

LLM Speed Rankings

Coding and language models sorted by time-to-first-token latency (lowest first).

#	Model	Provider	Latency	Throughput	Score	Output $/1M
1	Qwen3 VL 30B A3B Thinking	Alibaba	304ms	315 tok/s	85	$1.56
2	MiMo-V2-Omni	Xiaomi	424ms	299 tok/s	85	$2.00
3	Grok Code Fast 1	xAI	448ms	296 tok/s	85	$1.50
4	GPT-5.1-Codex-Mini	OpenAI	472ms	292 tok/s	85	$2.00
5	Nemotron 3 Super (free)	NVIDIA	568ms	280 tok/s	84	Free
6	Seed-2.0-Lite	ByteDance	592ms	276 tok/s	85	$2.00
7	GPT-5.4 Nano	OpenAI	616ms	273 tok/s	85	$1.25
8	GPT-5	OpenAI	664ms	267 tok/s	90	$10.00
9	Seed 1.6 Flash	ByteDance	664ms	267 tok/s	85	$0.30
10	o4 Mini	OpenAI	664ms	267 tok/s	84	$4.40
11	Qwen3 VL 8B Thinking	Alibaba	688ms	264 tok/s	85	$1.36
12	Grok 4.1 Fast	xAI	712ms	260 tok/s	87	$0.50
13	GPT-5.1-Codex	OpenAI	712ms	260 tok/s	85	$10.00
14	GPT-5 Codex	OpenAI	712ms	260 tok/s	85	$10.00
15	Gemini 2.5 Flash Lite Preview 09-2025	Google	760ms	254 tok/s	84	$0.40
16	Seed 1.6	ByteDance	784ms	251 tok/s	85	$2.00
17	o4 Mini Deep Research	OpenAI	784ms	251 tok/s	85	$8.00
18	o3	OpenAI	832ms	244 tok/s	86	$8.00
19	Qwen3.5 Plus 2026-02-15	Alibaba	832ms	244 tok/s	85	$1.56
20	Gemini 3 Flash Preview	Google	856ms	241 tok/s	89	$3.00
21	GPT-5.1	OpenAI	856ms	241 tok/s	85	$10.00
22	MiMo-V2-Pro	Xiaomi	856ms	241 tok/s	85	$3.00
23	Seed-2.0-Mini	ByteDance	856ms	241 tok/s	85	$0.40
24	GPT-5.1 Chat	OpenAI	856ms	241 tok/s	85	$10.00
25	Kimi K2.5	Moonshot AI	880ms	238 tok/s	85	$2.20
26	Claude Opus 4.5	Anthropic	904ms	235 tok/s	90	$25.00
27	GPT-5.2-Codex	OpenAI	928ms	232 tok/s	85	$14.00
28	GPT-5.3-Codex	OpenAI	952ms	228 tok/s	85	$14.00
29	Gemini 2.5 Pro Preview 06-05	Google	976ms	225 tok/s	84	$10.00
30	Grok 4	xAI	1.0s	219 tok/s	86	$15.00
31	Gemini 2.5 Pro	Google	1.0s	219 tok/s	85	$10.00
32	GPT-5.3 Chat	OpenAI	1.0s	216 tok/s	85	$14.00
33	GPT-5.4 Mini	OpenAI	1.1s	212 tok/s	93	$4.50
34	GPT-5.2	OpenAI	1.1s	212 tok/s	93	$14.00
35	Grok 4.20 Beta	xAI	1.1s	212 tok/s	86	$6.00
36	GPT-5.1-Codex-Max	OpenAI	1.1s	206 tok/s	85	$10.00
37	o4 Mini High	OpenAI	1.1s	206 tok/s	85	$4.40
38	Claude Opus 4.6	Anthropic	1.2s	200 tok/s	92	$25.00
39	GPT-5.4	OpenAI	1.3s	187 tok/s	94	$15.00
40	Gemini 3 Pro Preview	Google	1.3s	187 tok/s	90	$12.00
41	Gemini 3.1 Pro Preview Custom Tools	Google	1.3s	187 tok/s	85	$12.00
42	o3 Pro	OpenAI	1.3s	180 tok/s	88	$80.00
43	Sonar Pro Search	Perplexity	1.3s	180 tok/s	85	$15.00
44	GPT-5 Pro	OpenAI	1.3s	177 tok/s	92	$120.00
45	o3 Deep Research	OpenAI	1.3s	177 tok/s	92	$40.00
46	Gemini 3.1 Pro Preview	Google	1.4s	174 tok/s	86	$12.00
47	GPT-5.2 Pro	OpenAI	1.4s	168 tok/s	93	$168.00
48	Claude Sonnet 4.6	Anthropic	1.6s	148 tok/s	89	$15.00
49	Claude Sonnet 4.5	Anthropic	1.6s	145 tok/s	89	$15.00
50	GPT-5.4 Pro	OpenAI	1.6s	142 tok/s	94	$180.00

Image & Video Model Latency

Generation time for media models (typically measured as full generation, not time-to-first-token).

Image Generation

1Nano Banana (Gemini 2.5 Flash Image)1.7s

2Adobe Firefly 31.8s

3Nano Banana 2 (Gemini 3.1 Flash Image Preview)2.3s

4Midjourney v6.12.5s

5Leonardo Phoenix3.1s

6GPT-5 Image Mini3.7s

7GPT-5 Image4.1s

8Nano Banana Pro (Gemini 3 Pro Image Preview)4.3s

9Ideogram 2.04.5s

10DALL-E 34.5s

11Imagen 34.9s

12Recraft V35.8s

13FLUX.1 Pro6.0s

14Stable Diffusion 3.56.2s

Video Generation

1Runway Gen-3 Alpha24.5s

2Luma Dream Machine25.5s

3LTX-Video 231.5s

4Stable Video Diffusion33.5s

5Sora35.5s

6MiniMax Video-0135.5s

7Wan 2.1 T2V38.5s

8Pika 2.042.5s

9Kling 1.658.5s

10Veo 21.1min

Understanding Speed Tiers

AI model latency varies dramatically based on model size, architecture, and whether reasoning/chain-of-thought is enabled.

Ultra-Fast

< 100ms

The fastest models available, optimized for real-time interactions. Small parameter counts and aggressive quantization deliver sub-100ms first-token latency. Includes models like Claude Haiku, Gemini Flash, and compact Qwen variants.

Best for: Autocomplete, code completion, chat UIs, real-time assistants

Fast

100ms - 500ms

Strong balance of speed and capability. These mid-sized models handle complex tasks while maintaining snappy response times. Includes models like Claude Sonnet, Codestral, and Llama Scout.

Best for: Production chatbots, coding assistants, search augmentation

Standard

500ms - 2s

Full-capability frontier models that prioritize output quality over raw speed. Most flagship models from major providers fall here. Includes GPT-5, Gemini Pro, and Grok.

Best for: Complex analysis, long-form generation, document processing

Deliberate

> 2s

Reasoning-heavy models that take extra time to think through problems. Chain-of-thought and extended reasoning add latency but significantly improve accuracy on hard tasks. Includes o3, DeepSeek R1, and Claude Opus.

Best for: Math, research, multi-step reasoning, agentic workflows

Speed vs Quality

Faster models do trade off some capability

The fastest models (sub-100ms) like Claude Haiku and Gemini Flash are optimized for speed with smaller parameter counts. They score well on simple tasks but trail frontier models on complex reasoning, multi-step coding, and nuanced analysis. For most production chatbot and autocomplete workloads, this trade-off is worthwhile.

The sweet spot: 200ms-500ms models

Models in the 200-500ms range -- like Claude Sonnet, Codestral, and Llama 4 Scout -- deliver 85-95% of frontier quality at 3-5x the speed. These are the workhorses for production applications where both quality and responsiveness matter.

Reasoning models are intentionally slow

Models like o3 and DeepSeek R1 use chain-of-thought reasoning that adds 2-10x latency. This is by design: the extra "thinking time" produces significantly better results on math, logic, and multi-step problems. Use these when accuracy matters more than speed.

Top 5 by Quality Score

1GPT-5.494pts1.3s

2GPT-5.4 Pro94pts1.6s

3GPT-5.4 Mini93pts1.1s

4GPT-5.293pts1.1s

5GPT-5.2 Pro93pts1.4s

Top 5 by Speed

1Qwen3 VL 30B A3B Thinking304ms85pts

2MiMo-V2-Omni424ms85pts

3Grok Code Fast 1448ms85pts

4GPT-5.1-Codex-Mini472ms85pts

5Nemotron 3 Super (free)568ms84pts

Key Takeaways

Throughput matters more than latency for batch workloads

If you are processing thousands of requests in parallel, tokens per second (throughput) is a better metric than time-to-first-token. A model with high throughput but moderate latency will finish large batch jobs faster than an ultra-low-latency model with lower throughput.

Latency varies by provider and time of day

The same model can have different latency depending on which API provider you use and current load. Peak hours (US business hours) typically see 20-50% higher latency. Consider using multiple providers or routing services for consistent performance.

Streaming reduces perceived latency dramatically

Even high-latency models feel responsive when streaming tokens. A model with 1.2s first-token latency but high throughput can feel faster than a 300ms model with low throughput. Enable streaming for all user-facing applications.

Explore More Comparisons

Compare models across speed, quality, pricing, and more. Find the right model for your specific use case.

Model Rankings|Speed Test Tool|Compare Models|API Pricing

Frequently Asked Questions

The fastest AI models by time-to-first-token include Groq-hosted Llama (under 100ms TTFT), Gemini Flash, and GPT-4o Mini. Speed depends on both the model architecture and the inference provider.

Not necessarily. Some models like Gemini Flash offer both speed and quality. However, reasoning models (o3, DeepSeek R1) are intentionally slower because they spend more compute time thinking, which improves accuracy on hard problems.

Three main factors: model size (smaller = faster), inference hardware (GPUs, custom chips), and provider optimization (batching, quantization, speculative decoding). The same model can be 5-10x faster on different providers.