The top AI models for Retrieval-Augmented Generation, ranked by a RAG-weighted composite score. Models are scored with bonuses for large context windows (fitting more retrieved chunks), structured JSON output (parsing extracted data), function calling (tool-based retrieval), and streaming (real-time answers). Updated hourly from 298+ models.
293
Total Models
225
128K+ Context
226
With JSON Mode
217
Function Calling
24
Free Models
| # | Model | Score |
|---|---|---|
| 1 | GPT-5.4 ProOpenAI | 114 |
| 2 | GPT-5.2 ProOpenAI | 113 |
| 3 | GPT-5 ProOpenAI | 113 |
| 4 | o3 ProOpenAI | 105 |
| 5 | Claude Opus 4.1Anthropic | 104 |
| 6 | o3 Deep ResearchOpenAI | 97 |
| 7 | o1-proOpenAI | 95 |
| 8 | Claude Opus 4.6Anthropic | 94 |
| 9 | Claude Opus 4Anthropic | 94 |
| 10 | Claude Opus 4.5Anthropic | 93 |
| 11 | GPT-5.4OpenAI | 93 |
| 12 | Claude Sonnet 4.5Anthropic | 92 |
| 13 | Qwen3 VL 30B A3B ThinkingAlibaba | 92 |
| 14 | Qwen3 VL 235B A22B ThinkingAlibaba | 92 |
| 15 | GPT-5.2OpenAI | 91 |
| 16 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 91 |
| 17 | Gemini 3.1 Pro PreviewGoogle | 91 |
| 18 | Gemini 3 Pro PreviewGoogle | 91 |
| 19 | Claude Sonnet 4.6Anthropic | 91 |
| 20 | GPT-5.1OpenAI | 90 |
| 21 | GPT-5.3-CodexOpenAI | 90 |
| 22 | GPT-5.2-CodexOpenAI | 90 |
| 23 | GPT-5OpenAI | 90 |
| 24 | Gemini 3 Flash PreviewGoogle | 89 |
| 25 | o4 Mini Deep ResearchOpenAI | 89 |
| 26 | GPT-5.1-Codex-MaxOpenAI | 89 |
| 27 | Gemini 3.1 Flash Lite PreviewGoogle | 89 |
| 28 | Gemini 2.5 ProGoogle | 89 |
| 29 | Gemini 2.5 Flash Lite Preview 09-2025Google | 88 |
| 30 | o1OpenAI | 88 |
RAG pipelines retrieve relevant chunks from a knowledge base and inject them into the prompt. Models with 128K+ token context windows can fit more retrieved passages alongside the user query, reducing information loss and improving answer quality. Larger context also enables multi-document synthesis across dozens of retrieved chunks simultaneously.
JSON mode ensures the model returns well-formed structured data instead of free-text prose. For RAG applications, this is critical when extracting entities, citations, or metadata from retrieved documents. Structured output makes it easy to parse responses, populate UIs, and feed results into downstream systems reliably.
Function calling lets the model invoke retrieval tools dynamically — querying vector databases, searching knowledge bases, or fetching documents mid-conversation. This enables agentic RAG architectures where the model decides what to retrieve, how many chunks to pull, and when to do follow-up searches for better answers.
RAG applications process large volumes of tokens per query — retrieved chunks plus the question plus the generated answer. At scale, input and output token costs add up fast. Models with competitive per-million-token pricing let you run RAG pipelines in production without excessive API bills, especially for high-traffic document Q&A systems.
Discover models by specific RAG capabilities, or compare top models head-to-head on the full leaderboard.