February 24, 2026

RTX GPU LLM Benchmarks: What Can Your Card Actually Run?

Complete benchmark table showing which local LLM models run on each RTX GPU from RTX 3060 to 5090. Token speeds, VRAM requirements, and recommendations for every budget.

You've got an RTX GPU and want to run local LLMs. But which models actually fit in your VRAM, and what kind of speed can you expect?

This guide cuts through the noise with real benchmark data from Hardware Corner testing llama.cpp on Ubuntu with CUDA 12.8. All speeds are token generation at 16K context using Q4_K_XL quantization.

The Quick Reference Table

GPU VRAM 7B-8B 14B 30B 70B Best For
RTX 4060 Ti 8GB 34 t/s ✓ 22 t/s ✓ 7B models, coding assistants
RTX 4060 Ti 16GB 34 t/s ✓ 22 t/s ✓ 8B-14B, longer context
RTX 3060 12GB 42 t/s ✓ 23 t/s ✓ Budget 7B-14B
RTX 4070 12GB 52 t/s ✓ 33 t/s ✓ Fast 7B-14B
RTX 4070 SUPER 12GB 56 t/s ✓ 37 t/s ✓ Solid 14B performer
RTX 4070 Ti 12GB 58 t/s ✓ 38 t/s ✓ 14B at higher speed
RTX 4070 Ti SUPER 16GB 72 t/s ✓ 47 t/s ✓ 16GB sweet spot
RTX 3080 10GB 74 t/s ✓ Fast 7B-8B
RTX 3080 Ti 12GB 88 t/s ✓ 52 t/s ✓ Solid 14B workhorse
RTX 3090 24GB 87 t/s ✓ 52 t/s ✓ 114 t/s ✓ —* 24GB sweet spot
RTX 3090 Ti 24GB 94 t/s ✓ 57 t/s ✓ 122 t/s ✓ —* Fastest 30B single GPU
RTX 4080 16GB 78 t/s ✓ 51 t/s ✓ Premium 14B
RTX 4080 SUPER 16GB 79 t/s ✓ 53 t/s ✓ Premium 14B
RTX 4090 24GB 104 t/s ✓ 69 t/s ✓ 140 t/s ✓ —* Consumer king
RTX 5090 32GB 145 t/s ✓ 103 t/s ✓ 142 t/s ✓ —* Best consumer LLM GPU

*70B models require offloading or dual-GPU setups on consumer cards. See below for details.

VRAM Requirements by Model Size

Here's roughly what you need in VRAM for different quantization levels:

Model Size FP16 Q8 Q6 Q4_K_XL Q4_K_M
7B 14GB 7GB 5.5GB 4.5GB 3.5GB
8B 16GB 8GB 6GB 5GB 4GB
14B 28GB 14GB 10GB 8GB 6GB
30B 60GB 30GB 22GB 18GB 14GB
70B 140GB 70GB 50GB 40GB 32GB

What Actually Fits in Each GPU

8GB VRAM (RTX 4060 Ti 8GB)

12GB VRAM (RTX 3060, 3080 10GB, 4070 series)

16GB VRAM (RTX 4060 Ti 16GB, 4080 series)

24GB VRAM (RTX 3090, 3090 Ti, RTX 4090)

Recommendations by Use Case

Budget Build (Under $400)

RTX 3060 12GB — Used market is ~$200-250. Handles 7B-14B at Q4-Q5. Not blazing fast but totally usable for coding assistants and chat.

Best Value (~$400-600)

RTX 3090 24GB (used) — The king of value. $600-800 gets you 24GB VRAM, enough for 30B models. Still competitive with newer cards on token speed.

16GB Sweet Spot (~$750-1200)

RTX 4070 Ti SUPER 16GB — Best balance of speed and capacity. 72 t/s on 8B, 47 t/s on 14B. Can push into 30B territory with quantization.

No Compromise (~$2000+)

RTX 4090 24GB — Still the consumer king. 140 t/s on 30B models. Can run 70B with partial offload.

RTX 5090 32GB — New champion. 145 t/s on 8B, 142 t/s on 30B. Long context handling is in a league its own.

Pro Tips

  1. Use LM Studio — It handles GPU layer allocation automatically and one-click downloads for quantized models.
  2. Lower context when possible — 4K-8K context uses significantly less VRAM than 16K+ and runs faster.
  3. Q4_K_M is the sweet spot — Good quality, fits in less VRAM, barely distinguishable from FP16 in chat use.
  4. Enable CPU offload — If a model is slightly too big, offloading layers to CPU RAM is slower but works.

Bottom Line

If you have a 12GB card, stick to 7B-14B models. If you want 30B capability, you need 16GB+ or better yet, grab a used 24GB RTX 3090 — it's still the best price-to-VRAM ratio in 2026.