Can your GPU run it? Find out which open-source models fit your hardware — VRAM requirements for every quantization level, with CPU offloading estimates.
Optional. Set to 0 for GPU-only.
Affects KV cache VRAM overhead.
+ 32 more models that don't fit
For models with HuggingFace GGUF data, estimates are based on actual file sizes and are very accurate. For other models, we calculate from parameter counts using standard quant ratios. KV cache uses architecture details when available. Real-world VRAM may vary by ~5-10% depending on runtime, batch size, and framework.
The sweet spot is the quantization level with the best quality-per-GB ratio. For most models, this is Q4_K_M — it preserves ~92% of full-precision quality while being ~4x smaller than FP16. Below Q4 shows noticeable degradation; above Q5 gives diminishing returns.
When a model doesn't fit entirely in GPU VRAM, some layers can be offloaded to system RAM. The GPU processes its layers at full speed, then waits for the CPU. Speed depends on how many layers are offloaded — 20% offload means ~2x slower, 40% means ~3-5x slower.
Yes. Apple Silicon uses unified memory, so the "VRAM" is your total RAM. Select your Mac from the GPU dropdown. Note that Macs use Metal/MLX instead of CUDA, so performance differs from NVIDIA cards.
This calculator focuses on GGUF quantizations used by llama.cpp (Q2_K through FP16). These are the most common format for local LLM inference. EXL2, AWQ, and GPTQ formats are not yet covered.