February 25, 2026
Qwen3.5-35B-A3B: First Local LLM That Passes Real Coding Tests
We're done playing with toy benchmarks. A new Qwen3.5 variant just became the first local LLM to pass a real-world coding recruitment test—the same test companies used for years to evaluate mid-level mobile developers. It completed in 10 minutes what took humans 5 hours.
The Numbers That Matter
Let's get straight to it:
Hardware: Single RTX 3090 (24GB VRAM)
Model: Qwen3.5-35B-A3B-MXFP4_MOE.gguf
Context: 131,072 tokens
Speed: 100+ tokens/second
That's not a toy benchmark. That's not a trivial "write a function" test. That's a full recruitment assessment—database design, API implementation, error handling—the works.
What Makes This Different
Local LLMs have been "good at coding" for a while now. But there's a massive gap between:
- Writing code snippets — "Write me a quicksort"
- Agentic coding — "Here's a codebase, fix this bug, run the tests, handle the errors"
Most local models fail at agentic coding because they're too slow or too dumb to handle the context window needed for real projects. Qwen3.5-35B-A3B solves both:
| Metric | Qwen3.5-35B-A3B | Previous Best (Local) | GPT-4o (Cloud) |
|---|---|---|---|
| Token Speed (RTX 3090) | 100+ t/s | 40-60 t/s | ~80 t/s |
| SWE-Bench Verified | 68.2% | ~55% | 71.3% |
| VRAM Required (FP4) | ~22 GB | 28-32 GB | N/A |
| Context Window | 131K | 32K-128K | 128K |
The MoE Secret Sauce
This isn't a standard dense model. Qwen3.5-35B-A3B is a Mixture-of-Experts (MoE) model with:
- 35B total parameters
- Only 3B active per token — hence "A3B"
- MXFP4 quantization — 4-bit mixed precision for massive VRAM savings
This is the same architecture trick that made DeepSeek-V3 famous, except Alibaba executed it better for local hardware. You get 35B parameter quality at ~22GB VRAM footprint—fitting on a single RTX 3090 or 4090.
What You Can Actually Run
🎮 RTX 3090 / 4090 (24GB)
Qwen3.5-35B-A3B at FP4: 100+ t/s. This is the sweet spot. Full agentic coding capability on hardware you can buy used for ~$800.
💻 RTX 4080 Super (16GB)
Qwen3.5-32B at Q4: ~50-60 t/s. Slower but still usable. Consider Qwen3-Coder-Next for better coding performance at this tier.
🔧 RTX 4070 Ti (12GB)
Qwen3 14B at Q5: ~45 t/s. Not agentic-capable, but excellent for code completion and small refactoring tasks.
Setup Instructions
Want to try it? Here's the llama.cpp command that achieved the 100+ t/s result:
./llama.cpp/llama-server \ -m /models/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \ -a "DrQwen" \ -c 131072 \ -ngl all \ -ctk q8_0 \ -ctv q8_0 \ -sm none \ -mg 0 \ -np 1 \ -fa on
Key flags: -ngl all offloads everything to GPU, -c 131072 sets 128K context, -fa on enables Flash Attention.
The Bottom Line
If you've been waiting for local LLMs to be "good enough" for real coding work, the wait is over. Qwen3.5-35B-A3B on a single RTX 3090:
- Passes actual coding recruitment tests
- Runs at 100+ tokens/second
- Fits in 22GB VRAM
- Costs ~$0 in ongoing API fees
The only reason to use cloud APIs for coding now is if you need the absolute highest quality (GPT-5, Claude 4 Opus) or don't have the hardware. For everyone else running 8x A100s in production—that's a different conversation. For developers wanting a local coding assistant? You're covered.