February 24, 2026
RTX GPU LLM Benchmarks: What Can Your Card Actually Run?
Complete benchmark table showing which local LLM models run on each RTX GPU from RTX 3060 to 5090. Token speeds, VRAM requirements, and recommendations for every budget.
You've got an RTX GPU and want to run local LLMs. But which models actually fit in your VRAM, and what kind of speed can you expect?
This guide cuts through the noise with real benchmark data from Hardware Corner testing llama.cpp on Ubuntu with CUDA 12.8. All speeds are token generation at 16K context using Q4_K_XL quantization.
The Quick Reference Table
| GPU | VRAM | 7B-8B | 14B | 30B | 70B | Best For |
|---|---|---|---|---|---|---|
| RTX 4060 Ti | 8GB | 34 t/s ✓ | 22 t/s ✓ | — | — | 7B models, coding assistants |
| RTX 4060 Ti | 16GB | 34 t/s ✓ | 22 t/s ✓ | — | — | 8B-14B, longer context |
| RTX 3060 | 12GB | 42 t/s ✓ | 23 t/s ✓ | — | — | Budget 7B-14B |
| RTX 4070 | 12GB | 52 t/s ✓ | 33 t/s ✓ | — | — | Fast 7B-14B |
| RTX 4070 SUPER | 12GB | 56 t/s ✓ | 37 t/s ✓ | — | — | Solid 14B performer |
| RTX 4070 Ti | 12GB | 58 t/s ✓ | 38 t/s ✓ | — | — | 14B at higher speed |
| RTX 4070 Ti SUPER | 16GB | 72 t/s ✓ | 47 t/s ✓ | — | — | 16GB sweet spot |
| RTX 3080 | 10GB | 74 t/s ✓ | — | — | — | Fast 7B-8B |
| RTX 3080 Ti | 12GB | 88 t/s ✓ | 52 t/s ✓ | — | — | Solid 14B workhorse |
| RTX 3090 | 24GB | 87 t/s ✓ | 52 t/s ✓ | 114 t/s ✓ | —* | 24GB sweet spot |
| RTX 3090 Ti | 24GB | 94 t/s ✓ | 57 t/s ✓ | 122 t/s ✓ | —* | Fastest 30B single GPU |
| RTX 4080 | 16GB | 78 t/s ✓ | 51 t/s ✓ | — | — | Premium 14B |
| RTX 4080 SUPER | 16GB | 79 t/s ✓ | 53 t/s ✓ | — | — | Premium 14B |
| RTX 4090 | 24GB | 104 t/s ✓ | 69 t/s ✓ | 140 t/s ✓ | —* | Consumer king |
| RTX 5090 | 32GB | 145 t/s ✓ | 103 t/s ✓ | 142 t/s ✓ | —* | Best consumer LLM GPU |
*70B models require offloading or dual-GPU setups on consumer cards. See below for details.
VRAM Requirements by Model Size
Here's roughly what you need in VRAM for different quantization levels:
| Model Size | FP16 | Q8 | Q6 | Q4_K_XL | Q4_K_M |
|---|---|---|---|---|---|
| 7B | 14GB | 7GB | 5.5GB | 4.5GB | 3.5GB |
| 8B | 16GB | 8GB | 6GB | 5GB | 4GB |
| 14B | 28GB | 14GB | 10GB | 8GB | 6GB |
| 30B | 60GB | 30GB | 22GB | 18GB | 14GB |
| 70B | 140GB | 70GB | 50GB | 40GB | 32GB |
What Actually Fits in Each GPU
8GB VRAM (RTX 4060 Ti 8GB)
- 7B models: Q4 and below — runs great
- 8B models: Q3 or lower
- 14B: Q2 or lower only (slow, not recommended)
12GB VRAM (RTX 3060, 3080 10GB, 4070 series)
- 7B-8B models: Any quantization — flies
- 14B models: Q4-Q6 works well
- 30B: Q4 minimum, but expect slower speeds
16GB VRAM (RTX 4060 Ti 16GB, 4080 series)
- 14B models: Q6-Q8 comfortably
- 30B models: Q4-Q5 works
- 34B MoE models: Q4 works on some
24GB VRAM (RTX 3090, 3090 Ti, RTX 4090)
- 30B models: Q6-Q8 at full speed
- 70B models: Can offload ~50% to VRAM with Q4, rest to RAM — usable but slow (3-6 t/s)
- Long context: Can handle 32K+ contexts on 14B-30B models
Recommendations by Use Case
Budget Build (Under $400)
RTX 3060 12GB — Used market is ~$200-250. Handles 7B-14B at Q4-Q5. Not blazing fast but totally usable for coding assistants and chat.
Best Value (~$400-600)
RTX 3090 24GB (used) — The king of value. $600-800 gets you 24GB VRAM, enough for 30B models. Still competitive with newer cards on token speed.
16GB Sweet Spot (~$750-1200)
RTX 4070 Ti SUPER 16GB — Best balance of speed and capacity. 72 t/s on 8B, 47 t/s on 14B. Can push into 30B territory with quantization.
No Compromise (~$2000+)
RTX 4090 24GB — Still the consumer king. 140 t/s on 30B models. Can run 70B with partial offload.
RTX 5090 32GB — New champion. 145 t/s on 8B, 142 t/s on 30B. Long context handling is in a league its own.
Pro Tips
- Use LM Studio — It handles GPU layer allocation automatically and one-click downloads for quantized models.
- Lower context when possible — 4K-8K context uses significantly less VRAM than 16K+ and runs faster.
- Q4_K_M is the sweet spot — Good quality, fits in less VRAM, barely distinguishable from FP16 in chat use.
- Enable CPU offload — If a model is slightly too big, offloading layers to CPU RAM is slower but works.
Bottom Line
If you have a 12GB card, stick to 7B-14B models. If you want 30B capability, you need 16GB+ or better yet, grab a used 24GB RTX 3090 — it's still the best price-to-VRAM ratio in 2026.