GPU VRAM vs System RAM: What You Can Actually Run on a 16GB Budget

Your RTX 3070 Ti has 8GB VRAM. That's a fact. But here's what the marketing doesn't tell you: you can actually access up to 16GB of memory for LLM inference through shared system RAM — if you know how to configure it.

The 16GB Memory Architecture Reality

Modern GPUs with Resizable BAR (Base Address Register) can access system RAM through PCIe. Your 3070 Ti's 8GB VRAM doesn't exist in isolation — it can borrow from your system RAM as a unified memory pool.

Here's the breakdown:

8GB VRAM — Fast, on-GPU memory (600+ GB/s bandwidth)
Up to 8GB shared system RAM — Slower but available (30-50 GB/s via PCIe)
Total: ~16GB — The actual memory budget you're working with

Note: This requires enabling Resizable BAR in your BIOS and using a loader that supports CPU offloading (llama.cpp, LM Studio, oobabooga).

What Fits in 8GB GPU-Only VRAM

Running purely in VRAM gives you the best inference speeds — 30-60+ tokens/second depending on your GPU. These models fit comfortably in 8GB:

Model	Quantization	VRAM Needed	Tokens/sec (3070 Ti)
Qwen 3 4B	Q4_K_M	~4.5 GB	45-55
Mistral 7B	Q4_0	~4.2 GB	35-42
Phi-4 Mini	Q4_K_M	~3.8 GB	50-60
Qwen 2.5 3B	Q4_K_M	~2.1 GB	60-70
Llama 3.2 3B	Q4_K_M	~2.3 GB	55-65

What Runs with CPU Offload (16GB Total)

With the full 16GB memory budget (VRAM + shared system RAM), you can run larger models. The tradeoff? Inference drops to 8-20 tokens/second because the CPU-GPU PCIe bus becomes the bottleneck.

Model	Quantization	Memory Needed	Tokens/sec (3070 Ti)
Llama 3.1 8B	Q4_0	~5.2 GB	30-40
Qwen 2.5 7B	Q4_K_M	~4.8 GB	35-45
Mistral 7B	Q5_K_S	~5.5 GB	28-35
DeepSeek Coder 7B	Q4_K_M	~4.9 GB	32-40
Qwen 3 8B	Q4_K_M	~5.8 GB	25-35

Integrated Graphics: You're Not Left Out

Using an AMD 780M (laptop APU) or Intel Xe integrated GPU? Don't sleep on local LLMs. You can still run smaller models — they're just using system RAM exclusively.

AMD 780M — Runs Qwen 3 4B at 12-18 tokens/sec (system RAM)
Intel Xe (Gen12+) — Runs Phi-4 Mini at 10-15 tokens/sec
No dedicated GPU — CPU-only with llama.cpp still works for 3B models

Hardware Recommendations

8GB VRAM (3070 Ti, 4060 Ti)

Stick to GPU-only for speed. Enable Resizable BAR.

✅ Qwen 3 4B — smooth
✅ Mistral 7B Q4 — solid
⚠️ 8B models — slow with offload

12GB VRAM (3080, 4070)

Sweet spot for hobbyists. 12GB is enough for most.

✅ Qwen 3 8B — runs in VRAM
✅ Llama 3.1 8B Q5
✅ Mistral Large 7B

16GB VRAM (3080 Ti, 4090)

Serious local LLM power. 70B models at Q4.

✅ Qwen 2.5 14B
✅ Llama 3.1 14B
✅ DeepSeek 33B

24GB+ VRAM (4090, 5090)

Desktop replacement territory.

✅ 70B models Q4
✅ 405B distilled models
✅ Mixtral 8x22B

LM Studio & CPU Offload: What Actually Works

Yes, all GGUF models support CPU offload — it's a function of the llama.cpp backend that LM Studio, Ollama, and oobabooga all use. But there's a catch: you still need enough system RAM to hold the entire model, even with GPU offloading.

Here's how it works:

GPU layers — Model weights loaded in VRAM for fast inference
CPU layers — Remaining weights stay in system RAM, swapped as needed
The catch — You need RAM = model size, even with 0% GPU offload

LM Studio defaults are conservative — it won't max out your GPU by default. You can adjust GPU layer count manually:

--gpu 0.5 — Offload 50% of layers to GPU
--gpu max — Offload everything possible
--gpu off — CPU only (useful for debugging)

Pro tip: Even with partial GPU offload, you'll see 2-5x speedup over CPU-only. The PCIe bus is slower than VRAM, but GPU compute is still way faster than CPU.

The Bottom Line

Your 8GB 3070 Ti isn't as limited as you think. With proper configuration, you've got 16GB of usable memory. The key is matching your model to your hardware:

Want speed? Stay within 8GB VRAM — Qwen 3 4B and Mistral 7B Q4 deliver 40+ tokens/sec
Need bigger models? Enable CPU offload and accept 10-20 tokens/sec
Integrated graphics? Qwen 3 4B or Phi-4 Mini off system RAM still works

The best model isn't the biggest one that "runs" — it's the biggest one that runs fast enough for your use case.