Question 1

How much GPU memory does LLaMA 3 8B need?

Accepted Answer

LLaMA 3 8B requires approximately 16 GB of GPU memory in FP16 precision (including overhead). With 4-bit quantization (INT4 or GGUF Q4), memory drops to around 5–6 GB, making it runnable on consumer GPUs like the RTX 3060.

Question 2

How do I calculate GPU memory for an LLM?

Accepted Answer

Multiply the number of parameters by bytes per parameter: FP32 uses 4 bytes, FP16/BF16 uses 2 bytes, INT8 uses 1 byte, INT4 uses 0.5 bytes. Then add ~20% for KV cache and activations. Example: 7B parameters × 2 bytes (FP16) = 14 GB + 20% overhead = ~17 GB total.

Question 3

Can I run a 70B model on a single GPU?

Accepted Answer

A 70B model in FP16 requires ~140 GB of VRAM — more than any single consumer GPU. However, with 4-bit quantization, memory drops to ~35–40 GB, which fits on a single NVIDIA A100 80GB or two RTX 4090s combined.

Question 4

What is model quantization?

Accepted Answer

Quantization reduces the numerical precision of model weights, shrinking memory requirements at a small quality cost. Common formats include INT8 (2× smaller than FP16), INT4 (4× smaller), and GGUF Q4_K_M (a popular format for llama.cpp). Most 4-bit quantized models retain 95%+ of the original quality.

Question 5

What GPU do I need to run LLMs locally?

Accepted Answer

For 7B models in 4-bit quantization, an 8–12 GB VRAM GPU like the RTX 3060 or 4060 is sufficient. For 13B models, 16–24 GB (RTX 3090, 4090). For 70B models, either multiple consumer GPUs or a professional A100/H100 is required.

GPU	VRAM	Fits?
NVIDIA RTX 4090	24 GB	✓ Yes
NVIDIA A100 (40GB)	40 GB	✓ Yes
NVIDIA A100 (80GB)	80 GB	✓ Yes
NVIDIA H100 (80GB)	80 GB	✓ Yes
2× A100 (80GB)	160 GB	✓ Yes
4× A100 (80GB)	320 GB	✓ Yes

How Much GPU Memory Does Your Model Need?

How much GPU memory does LLaMA 3 8B need?

How do I calculate GPU memory for an LLM?

Can I run a 70B model on a single GPU?

What is model quantization?

What GPU do I need to run LLMs locally?