Llama.cpp's Multi-Token Prediction: The Speed Boost Your Local AI Has Been Waiting For

by TopClanker

Multi-Token Prediction lets your local model generate 2-4 tokens in a single forward pass instead of one. Llama.cpp just added MTP support — here are the real benchmarks and what it means for your hardware.

Llama.cpp quietly added Multi-Token Prediction support a few weeks back. If you’re running local AI on anything from an RTX 3090 to an M-series Mac, this is the update that matters.

MTP isn’t a quantization trick or a kernel hack. It’s a fundamental change to how the model generates text — predicting multiple tokens in parallel instead of one at a time. The results are real. The speedups are significant. And the tradeoff is predictable.

What Is Multi-Token Prediction?

Standard language models work sequentially. Given a prompt, they predict one token, append it, then predict the next. One forward pass per token. That works, but it’s slow — especially on larger models where every pass through the network requires moving gigabytes of weights from memory to compute.

MTP changes this by training the model to predict several tokens at once. During inference, a single forward pass produces multiple token predictions simultaneously — say, tokens +1, +2, +3, and +4. These aren’t all accepted outright; they’re verified sequentially against the model’s own internal state. The key is that the expensive parallel computation happens once per block of tokens, not once per token.

The analogy that works: traditional decoding is typing one letter at a time. MTP is more like having the model write a phrase, then checking each word as it goes. The typing is still sequential — but you’re not re-loading the dictionary between every keystroke.

This is closely related to speculative decoding — but where speculative decoding typically uses a separate draft model, MTP bakes the multi-token capability into the main model architecture via auxiliary prediction heads. Fewer moving parts, tighter integration, lower overhead.

The Real Benchmarks

Here’s what this actually looks like on real hardware:

Model Hardware Baseline (tok/s) With MTP (tok/s) Improvement
Qwen3.5 27B RTX 3090 15.3 23.3 +52%
Qwen3.6 27B RTX 3090 38 65 +71%
Gemma 4 (9B) M1 Max Mac 12.4 17.7 +43%
Gemma 4 Various consumer GPUs ~baseline ~+40% +40%

Data sourced from r/LocalLLaMA community benchmarks, DataCamp MTP tutorial testing, and startup fortune reporting on the llama.cpp beta implementation.

The Qwen3.6 results (38 → 65 tok/s on the same 24GB RTX 3090) are the most striking. That’s a 71% throughput gain on a 27B model — the kind of improvement that moves local AI from “functional” to “genuinely interactive” for longer context windows.

Gemma 4 shows consistent ~40% gains across hardware, which suggests the improvement generalizes well across different GPU architectures and model sizes.

How MTP Actually Works in Llama.cpp

Llama.cpp implements MTP via additional prediction heads trained on the same training data as the base model. At each decoding step, these heads predict future tokens in parallel. The system then verifies predictions sequentially using the transformer’s attention mechanism.

Key implementation details from the llama.cpp PR:

  • MTP is enabled via a --mtp flag in the CLI (or corresponding parameter in the API)
  • Prediction depth (number of tokens predicted per forward pass) is configurable — higher values mean more parallelism but also more VRAM overhead
  • The acceptance rate for MTP predictions runs around 90% in practice — meaning most predicted tokens are accepted without correction, which is why the speedup is real and not just theoretical
  • Models need to have been trained with MTP support to benefit. Standard GGUF models without MTP training won’t see gains.

Hardware Requirements: VRAM Is the Trade

Here’s the catch: MTP isn’t free. Those extra prediction heads need to live somewhere, and that somewhere is VRAM.

MTP increases VRAM usage by roughly 15-25% depending on the model size and prediction depth. You’re trading memory for speed — exactly the kind of tradeoff that defines local AI optimization.

Practical VRAM impact:

Model Standard VRAM (Q4_K_M) With MTP (Q4_K_M)
7B ~4-5GB ~5-6GB
14B ~8-10GB ~10-12GB
27B ~14-16GB ~17-20GB
32B ~18GB ~22GB

This means a 16GB GPU that comfortably fits a 14B model at Q4 won’t fit it with MTP enabled. You’ll need to drop to Q5 or reduce batch size. Plan accordingly.

Which Models Support MTP?

Not every model works with MTP out of the box. The model needs to have been trained with multi-token prediction objectives — most standard models don’t have this by default.

Models confirmed to work with MTP in llama.cpp:

  • Qwen3.5 (various sizes, 27B shows the biggest gains)
  • Qwen3.6 (27B variant shows strong results)
  • Gemma 4 (9B and larger variants)
  • Mistral (specific fine-tuned variants — check model card for MTP support)

Check the model’s HuggingFace card or llama.cpp GitHub for the specific --mtp compatibility flag before assuming it works. Running MTP on a model that wasn’t trained for it will either error out or produce garbage output.

How to Enable MTP in Llama.cpp

Assuming your model supports MTP, enabling it is straightforward.

CLI:

./llama-cli -m your-model-q4_k_m.gguf \
  --ctx-size 4096 \
  --mtp 4 \
  -p "You are a helpful assistant."

The --mtp 4 flag tells llama.cpp to predict 4 tokens per forward pass. You can try 2, 4, or higher — experiment based on your VRAM budget.

LM Studio: MTP support is progressively rolling out in LM Studio’s nightly builds. Check the release notes for your version. Not all UI versions expose MTP configuration yet, but the underlying llama.cpp support is there.

Ollama: Ollama has not yet merged full MTP support as of this writing. If MTP is critical for your use case, use llama.cpp directly or LM Studio.

The Tradeoff Summary

Factor Standard Decoding MTP
Tokens/sec Baseline +40-70%
VRAM usage Baseline +15-25%
Model compatibility Any GGUF MTP-trained only
Quality impact None None (at 90%+ acceptance)
Setup complexity Minimal Flag + compatible model

MTP is a clear win if you have the VRAM headroom. If you’re running a 27B model on a 24GB RTX 3090 and you’re already near your VRAM limit, you’ll need to drop quantization or reduce context length to enable MTP. Whether that tradeoff makes sense depends on whether you value throughput more than context window.

The Bigger Picture

MTP represents a meaningful architectural shift in how local inference works. For the past two years, most gains came from quantization, kernel optimization (CUDA/Metal), and KV cache improvements — all tweaks to the runtime, not the model. MTP changes what the model itself does at generation time.

That matters for the trajectory of local AI. As more models get trained with MTP support, and as llama.cpp continues to refine the implementation, we’ll see the gap between local and cloud inference continue to narrow on throughput even if it persists on model size.

The RTX 3090 running Qwen3.6 at 65 tok/s with MTP is not the same as a cloud API — but it’s getting closer on responsiveness. For use cases where first-token latency and streaming throughput matter (coding copilots, interactive agents, long-context Q&A), that difference is noticeable.

Sources