Qwen3.5-35B-A3B: First Local LLM That Passes Real Coding Tests

We're done playing with toy benchmarks. A new Qwen3.5 variant just became the first local LLM to pass a real-world coding recruitment test—the same test companies used for years to evaluate mid-level mobile developers. It completed in 10 minutes what took humans 5 hours.

The Numbers That Matter

Let's get straight to it:

Hardware: Single RTX 3090 (24GB VRAM)

Model: Qwen3.5-35B-A3B-MXFP4_MOE.gguf

Context: 131,072 tokens

Speed: 100+ tokens/second

That's not a toy benchmark. That's not a trivial "write a function" test. That's a full recruitment assessment—database design, API implementation, error handling—the works.

What Makes This Different

Local LLMs have been "good at coding" for a while now. But there's a massive gap between:

Writing code snippets — "Write me a quicksort"
Agentic coding — "Here's a codebase, fix this bug, run the tests, handle the errors"

Most local models fail at agentic coding because they're too slow or too dumb to handle the context window needed for real projects. Qwen3.5-35B-A3B solves both:

Metric	Qwen3.5-35B-A3B	Previous Best (Local)	GPT-4o (Cloud)
Token Speed (RTX 3090)	100+ t/s	40-60 t/s	~80 t/s
SWE-Bench Verified	68.2%	~55%	71.3%
VRAM Required (FP4)	~22 GB	28-32 GB	N/A
Context Window	131K	32K-128K	128K

The MoE Secret Sauce

This isn't a standard dense model. Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) model with:

35B total parameters
Only 3B active per token — hence "A3B"
MXFP4 quantization — 4-bit mixed precision for massive VRAM savings

This is the same architecture trick that made DeepSeek-V3 famous, except Alibaba executed it better for local hardware. You get 35B parameter quality at ~22GB VRAM footprint—fitting on a single RTX 3090 or 4090.

What You Can Actually Run

🎮 RTX 3090 / 4090 (24GB)

Qwen3.5-35B-A3B at FP4: 100+ t/s. This is the sweet spot. Full agentic coding capability on hardware you can buy used for ~$800.

💻 RTX 4080 Super (16GB)

Qwen3.5-32B at Q4: ~50-60 t/s. Slower but still usable. Consider Qwen3-Coder-Next for better coding performance at this tier.

🔧 RTX 4070 Ti (12GB)

Qwen3 14B at Q5: ~45 t/s. Not agentic-capable, but excellent for code completion and small refactoring tasks.

Setup Instructions

Want to try it? Here's the llama.cpp command that achieved the 100+ t/s result:

./llama.cpp/llama-server \
  -m /models/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
  -a "DrQwen" \
  -c 131072 \
  -ngl all \
  -ctk q8_0 \
  -ctv q8_0 \
  -sm none \
  -mg 0 \
  -np 1 \
  -fa on

Key flags: -ngl all offloads everything to GPU, -c 131072 sets 128K context, -fa on enables Flash Attention.

The Bottom Line

If you've been waiting for local LLMs to be "good enough" for real coding work, the wait is over. Qwen3.5-35B-A3B on a single RTX 3090:

Passes actual coding recruitment tests
Runs at 100+ tokens/second
Fits in 22GB VRAM
Costs ~$0 in ongoing API fees

The only reason to use cloud APIs for coding now is if you need the absolute highest quality (GPT-5, Claude 4 Opus) or don't have the hardware. For everyone else running 8x A100s in production—that's a different conversation. For developers wanting a local coding assistant? You're covered.