Open-source coding models you can run on your own hardware. No API keys, no per-token costs, no data leaving your machine. Ranked by our composite scoring system across 177 locally-runnable coding models, updated hourly.
| # | Model | Score |
|---|---|---|
| 1 | Gemini 3.1 Pro Preview Custom ToolsGoogle | 89 |
| 2 | Qwen3.5-FlashAlibaba | 89 |
| 3 | Nemotron 3 Super (free)NVIDIA | 88 |
| 4 | Seed-2.0-LiteByteDance | 88 |
| 5 | Qwen3.5-35B-A3BAlibaba | 87 |
| 6 | Qwen3.5-27BAlibaba | 87 |
| 7 | Qwen3.5-122B-A10BAlibaba | 87 |
| 8 | Qwen3.5 397B A17BAlibaba | 87 |
| 9 | Kimi K2.5Moonshot AI | 87 |
| 10 | Qwen3 VL 8B ThinkingAlibaba | 85 |
| 11 | Qwen3 VL 30B A3B ThinkingAlibaba | 85 |
| 12 | Qwen3 VL 235B A22B ThinkingAlibaba | 85 |
| 13 | MiniMax M2.5 (free)MiniMax | 80 |
| 14 | MiniMax M2.5MiniMax | 80 |
| 15 | MiniMax M2MiniMax | 80 |
| 16 | MiMo-V2-FlashXiaomi | 79 |
| 17 | Trinity Miniarcee-ai | 79 |
| 18 | Nemotron Nano 12B 2 VL (free)NVIDIA | 79 |
| 19 | Tongyi DeepResearch 30B A3BAlibaba | 79 |
| 20 | Qwen3 235B A22B Thinking 2507Alibaba | 79 |
Local LLMs keep your code on your machine — no data sent to external servers. You get zero per-token cost after setup, full offline capability, and no rate limits. Ideal for proprietary codebases, air-gapped environments, or developers who want complete control over their AI toolchain.
A 7B model at Q4 quantization runs comfortably with 6 GB VRAM (RTX 3060). For 13B-34B models, aim for 12-24 GB VRAM (RTX 4090, A5000). Larger 70B+ models need 48+ GB across multiple GPUs or specialized hardware. CPU-only inference works but is 5-10x slower.
Quantization shrinks models to fit consumer hardware. GGUF (llama.cpp format) is the most popular for local use, offering Q4, Q5, and Q8 variants. GPTQ and AWQ are GPU-optimized alternatives. Lower quantization (Q4) trades minimal quality for much smaller memory footprint — often the sweet spot for coding tasks.
Ollama is the fastest way — install it, pull a model, and go. llama.cpp gives maximum control and best CPU performance. vLLM is ideal for GPU serving with high throughput. For IDE integration, Continue.dev connects to any local endpoint. All tools are free and open source.
Based on our composite scoring that evaluates benchmarks, code quality, and real-world performance, Gemini 3.1 Pro Preview Custom Tools currently leads our local coding LLM rankings with a score of 89. Other top local models include Qwen3.5-Flash, Nemotron 3 Super (free), and Seed-2.0-Lite. All of these can be downloaded and run on your own hardware using tools like Ollama, llama.cpp, or vLLM.
It depends on the model size and quantization. A 7B parameter model at Q4 quantization needs roughly 4-6 GB of VRAM, making it runnable on most modern GPUs. A 13B model needs 8-10 GB, and 34B+ models typically require 16-24 GB or more. CPU-only inference is possible with llama.cpp but significantly slower. For the best experience, an NVIDIA RTX 3060 (12 GB) or RTX 4090 (24 GB) is recommended.
Ollama is the easiest way to get started — it handles model downloading, quantization, and serving with a single command. Just install Ollama, run "ollama pull codellama" (or any supported model), and start chatting. For IDE integration, extensions like Continue.dev can connect to your local Ollama instance. More advanced users can use llama.cpp for maximum performance tuning or vLLM for high-throughput serving.
The gap has narrowed significantly. Top open-source coding models like DeepSeek Coder V2, CodeLlama 70B, and Qwen2.5-Coder perform competitively on benchmarks like HumanEval and SWE-bench. For many everyday coding tasks — autocompletion, refactoring, writing tests, explaining code — local models are excellent. Cloud models still tend to have an edge on very complex multi-step reasoning and large-codebase understanding, but local models offer unbeatable privacy and zero ongoing cost.
Explore more model rankings, compare specific models head-to-head, or filter by capabilities on the full leaderboard.