February 24, 2026

RTX GPU LLM Benchmarks: What Can Your Card Actually Run?

Complete benchmark table showing which local LLM models run on each RTX GPU from RTX 3060 to 5090. Token speeds, VRAM requirements, and recommendations for every budget.

You've got an RTX GPU and want to run local LLMs. But which models actually fit in your VRAM, and what kind of speed can you expect?

This guide cuts through the noise with real benchmark data from Hardware Corner testing llama.cpp on Ubuntu with CUDA 12.8. All speeds are token generation at 16K context using Q4_K_XL quantization.

The Quick Reference Table

GPU	VRAM	7B-8B	14B	30B	70B	Best For
RTX 4060 Ti	8GB	34 t/s ✓	22 t/s ✓	—	—	7B models, coding assistants
RTX 4060 Ti	16GB	34 t/s ✓	22 t/s ✓	—	—	8B-14B, longer context
RTX 3060	12GB	42 t/s ✓	23 t/s ✓	—	—	Budget 7B-14B
RTX 4070	12GB	52 t/s ✓	33 t/s ✓	—	—	Fast 7B-14B
RTX 4070 SUPER	12GB	56 t/s ✓	37 t/s ✓	—	—	Solid 14B performer
RTX 4070 Ti	12GB	58 t/s ✓	38 t/s ✓	—	—	14B at higher speed
RTX 4070 Ti SUPER	16GB	72 t/s ✓	47 t/s ✓	—	—	16GB sweet spot
RTX 3080	10GB	74 t/s ✓	—	—	—	Fast 7B-8B
RTX 3080 Ti	12GB	88 t/s ✓	52 t/s ✓	—	—	Solid 14B workhorse
RTX 3090	24GB	87 t/s ✓	52 t/s ✓	114 t/s ✓	—*	24GB sweet spot
RTX 3090 Ti	24GB	94 t/s ✓	57 t/s ✓	122 t/s ✓	—*	Fastest 30B single GPU
RTX 4080	16GB	78 t/s ✓	51 t/s ✓	—	—	Premium 14B
RTX 4080 SUPER	16GB	79 t/s ✓	53 t/s ✓	—	—	Premium 14B
RTX 4090	24GB	104 t/s ✓	69 t/s ✓	140 t/s ✓	—*	Consumer king
RTX 5090	32GB	145 t/s ✓	103 t/s ✓	142 t/s ✓	—*	Best consumer LLM GPU

*70B models require offloading or dual-GPU setups on consumer cards. See below for details.

VRAM Requirements by Model Size

Here's roughly what you need in VRAM for different quantization levels:

Model Size	FP16	Q8	Q6	Q4_K_XL	Q4_K_M
7B	14GB	7GB	5.5GB	4.5GB	3.5GB
8B	16GB	8GB	6GB	5GB	4GB
14B	28GB	14GB	10GB	8GB	6GB
30B	60GB	30GB	22GB	18GB	14GB
70B	140GB	70GB	50GB	40GB	32GB

What Actually Fits in Each GPU

8GB VRAM (RTX 4060 Ti 8GB)

7B models: Q4 and below — runs great
8B models: Q3 or lower
14B: Q2 or lower only (slow, not recommended)

12GB VRAM (RTX 3060, 3080 10GB, 4070 series)

7B-8B models: Any quantization — flies
14B models: Q4-Q6 works well
30B: Q4 minimum, but expect slower speeds

16GB VRAM (RTX 4060 Ti 16GB, 4080 series)

14B models: Q6-Q8 comfortably
30B models: Q4-Q5 works
34B MoE models: Q4 works on some

24GB VRAM (RTX 3090, 3090 Ti, RTX 4090)

30B models: Q6-Q8 at full speed
70B models: Can offload ~50% to VRAM with Q4, rest to RAM — usable but slow (3-6 t/s)
Long context: Can handle 32K+ contexts on 14B-30B models

Recommendations by Use Case

Budget Build (Under $400)

RTX 3060 12GB — Used market is ~$200-250. Handles 7B-14B at Q4-Q5. Not blazing fast but totally usable for coding assistants and chat.

Best Value (~$400-600)

RTX 3090 24GB (used) — The king of value. $600-800 gets you 24GB VRAM, enough for 30B models. Still competitive with newer cards on token speed.

16GB Sweet Spot (~$750-1200)

RTX 4070 Ti SUPER 16GB — Best balance of speed and capacity. 72 t/s on 8B, 47 t/s on 14B. Can push into 30B territory with quantization.

No Compromise (~$2000+)

RTX 4090 24GB — Still the consumer king. 140 t/s on 30B models. Can run 70B with partial offload.

RTX 5090 32GB — New champion. 145 t/s on 8B, 142 t/s on 30B. Long context handling is in a league its own.

Pro Tips

Use LM Studio — It handles GPU layer allocation automatically and one-click downloads for quantized models.
Lower context when possible — 4K-8K context uses significantly less VRAM than 16K+ and runs faster.
Q4_K_M is the sweet spot — Good quality, fits in less VRAM, barely distinguishable from FP16 in chat use.
Enable CPU offload — If a model is slightly too big, offloading layers to CPU RAM is slower but works.

Bottom Line

If you have a 12GB card, stick to 7B-14B models. If you want 30B capability, you need 16GB+ or better yet, grab a used 24GB RTX 3090 — it's still the best price-to-VRAM ratio in 2026.