Gemma 4 on Ollama: The Complete Local Deployment Guide That Actually Works

Gemma 4 dropped on Ollama this week, and if you’ve been waiting for a capable, permissively-licensed model that actually runs well on consumer hardware — this might be your answer.

Google’s Gemma 4 under Apache 2.0 means no usage restrictions, no licensing fees, and the full weight of the open-source community behind the tooling ecosystem. Combined with Ollama’s simplified deployment story, running a frontier-adjacent model on your own machine has never required less friction.

But “runs well” is doing a lot of work in that sentence. Here’s what actually works, what doesn’t, and what the benchmarks gloss over.

Why Ollama Changes the Game

Ollama has become the de facto standard for local model deployment not because it’s the most powerful — it’s because it removes the configuration overhead that kills local AI adoption. ollama run gemma:4-27b-it-q4_k_m and you’re running inference. No GPU layer calculations, no quantization scripts, no model card archaeology.

The Gemma 4 release on Ollama covers the full lineup:

E2B (2B params) — CPU viable, 4GB RAM
E4B (4B params) — GPU recommended but optional, 8GB VRAM
26B A4B — MoE architecture, activates 3.8B params/token, ~8GB VRAM
31B — Full model, requires 16GB+ VRAM

Each variant is a quantization variant. The Q4_K_M quantizations offered through Ollama strike the best balance of quality and VRAM for most hardware configurations.

The Real-World Performance Picture

The benchmark headlines for Gemma 4 are impressive. “Frontier-level performance at each size” is Google’s claim, and independent testing broadly supports it on standard benchmarks.

But here’s what the benchmarks don’t tell you:

On CPU: The E2B and E4B variants run at 15-25 tokens/second on a modern 12th-gen Intel CPU. That’s interactive — not fast, but usable for document review, coding assistance, and reasoning tasks. The moment you need 70 tokens/second, you’re going to GPU.

On GPU: The 26B A4B MoE variant at Q4_K_M hits 40-60 tokens/second on an RTX 4090. That’s genuine productivity speed — fast enough for real code generation, not just code review.

The sub-score problem: The r/LocalLLaMA comparison between Gemma 4 E4B and Qwen3.5-4B illustrates the gap. Qwen wins on raw benchmark scores — sometimes by 28 points on OlmOCR. But Gemma 4’s sub-scores on document task quality show better coherence and instruction following in practice. Benchmarks measure what they measure; what matters is what the model actually produces on your task.

What to Actually Run

The sweet spot for most practitioners:

For M-series Mac users: The 26B A4B at Q4_K_M fits in unified memory with headroom. Run it at 30-40 tokens/second depending on context length.

For RTX 4090 / equivalent: The 31B full model at Q4_K_M. 16GB VRAM used, 60+ tokens/second, full Gemma 4 capability.

For CPU-only or constrained hardware: E4B at Q4_K_M. Slower (15-25 tokens/second) but still genuinely useful for reading comprehension, summarization, and light coding.

The ollama run gemma:4-27b-it-q4_k_m command works as a starting point, but the community has produced optimized configs for each hardware tier worth exploring.

The Setup That Actually Works

# Pull the model
ollama pull gemma:4-27b-it-q4_k_m

# Run with appropriate context for coding
ollama run gemma:4-27b-it-q4_k_m --ctx 8192

# Or for document tasks
ollama run gemma:4-27b-it-q4_k_m --ctx 16384

The context window on Gemma 4 is competitive — 32K on the larger variants. For document processing, that means you can feed it a full technical spec without chunking. The MoE architecture on the 26B A4B means it’s not activating all parameters on every token — that’s where the efficiency win comes from.

Where It Falls Short

Gemma 4’s instruction following is strong but not Claude-level. On ambiguous requirements, it tends to pick a direction and commit rather than asking clarifying questions. For production code generation, that means more review cycles.

The other limitation: Google hasn’t released training data specifics. That’s standard for them, but it means you can’t audit for problematic content the way you can with fully transparent models. For internal tooling, that’s fine. For regulated environments, it may be a factor.

For most developers looking for a capable daily driver that they can run anywhere — their MacBook, their workstation, a $500 GPU build — Gemma 4 on Ollama is currently one of the best cost-to-capability ratios available. April 2026 is a good time to be running local.

Sources: