Local coding - AI Aggregator

Why local?

Privacy by default, no rate limits, predictable cost after the hardware, and it works offline. What you give up is the top of the curve: cloud frontier models still pull ahead on long-horizon agent work and on the most complex repo-wide reasoning. For completions, refactors, code review, and small multi-file edits, local in 2026 is competitive.

Step 1 - Pick your hardware tier

Tier	Top model size at usable quant	Tokens/sec ballpark	Watts
Apple Silicon, 16GB unified	7-9B Q4	15-30	~30W
Apple Silicon, 32-64GB unified	up to ~30B dense, or ~80B MoE	20-40	40-80W
Apple Silicon, 256GB (M3 Ultra Mac Studio)	up to ~120B MoE, or 70B dense at low quant	15-25	~120W
GPU, 16-24GB VRAM	14-27B Q4	60-120	250-450W
GPU, 48GB+ VRAM	70B-class at Q4	80-150	450W+
CPU + 32GB RAM	7-9B Q4	3-8	~65W

Step 2 - Pick your model

Match to your tier. Each links to its tracked page on this site.

8GB GPU / 16GB Mac. Qwen3.5-9B at Q4. Best fit we track for this tier; small open coders are still the thinnest part of the 2026 landscape, so calibrate expectations - good for inline completions, less good for autonomous edits.
16-24GB GPU / 32GB Mac. Mistral Small 3.2 24B or Qwen3.6-27B. The volume sweet spot. gpt-oss-20b is a permissive-license alternative that runs on a wider range of hardware.
48GB GPU / 64GB+ Mac. Qwen3-Coder-Next. 3B active / 80B total MoE, ~70.6% SWE-Bench Verified, 256K context, Apache 2.0. Strongest open-weights coder that fits a single workstation today.

What we're not recommending and why: DeepSeek V4-Pro (~80.6% SWE-Bench Verified, 1T params, 1M context) and GLM-5.1 (754B MoE, 58.4% SWE-Bench Pro - the leading open-weights score) are both real options, but GLM-5.1 wants ~8x H100 minimum and DeepSeek V4 is in the same data-center bracket. They're outside what one developer fits under a desk.

Step 3 - Pick your runtime

Ollama - easiest install, wraps llama.cpp, has a model registry. Default for new users.
LM Studio - GUI on top of llama.cpp / MLX. Pleasant on a Mac.
llama.cpp - GGUF directly, no abstraction, every flag exposed.
MLX - Apple Silicon native. Faster than llama.cpp on M-series for many models.
vLLM - production GPU serving. Overkill for one developer; right answer if you're sharing the machine.

Step 4 - Pick the agent

Coming from Claude Code, you have two paths.

Keep Claude Code, point it at a local backend. Set ANTHROPIC_BASE_URL to a local proxy that translates to/from the Anthropic Messages API. LiteLLM in front of Ollama is the common setup. You keep the workflow you know; you lose the parts of Claude Code's scaffold that were tuned to Claude's behavior. Sensible if local is your sometimes-mode, not your default.

Switch tools. Pick by where you live:

Aider - terminal, git-aware, native BYOM, repo map, auto-commits. The closest analogue to Claude Code's feel and the cleanest swap if you're going local-first.
Continue.dev - VSCode / JetBrains. Inline completions plus a chat sidebar, BYOM.
Cline - VSCode visual agent with step-by-step approval and browser automation.
Roo Code, OpenCode - newer entrants, worth tracking but less stable.

A note on tool-use: agents tuned for frontier cloud models (Claude Code, Cline) tend to over-call tools or burn context on local models with weaker function-calling. Aider's diff-based approach degrades more gracefully on small models.

Step 5 - Wire it up

Two minimal recipes. Substitute your model and quant.

Aider with Ollama:

ollama pull qwen3-coder-next:q4
aider --model ollama/qwen3-coder-next

Claude Code at a local endpoint:

litellm --model ollama/qwen3-coder-next --port 4000
ANTHROPIC_BASE_URL=http://localhost:4000 claude

The proxy handles Anthropic-to-OpenAI translation. Expect rough edges around tool-use streaming.

Honest limits

The gap to cloud frontier has narrowed but not closed. On SWE-Bench Verified (April 2026), Qwen3-Coder-Next sits around 70%, the strongest open coders that fit a workstation (Qwen3.6-27B at 77.2%, MiniMax M2.5 at 80.2%) reach the high 70s to low 80s, and the largest open models (DeepSeek V4-Pro at 80.6%) sit just over - against Claude Opus 4.7 at 87.6% on the same benchmark. The headline gap is real but narrower than a year ago. On the kinds of tasks that fit in a few thousand tokens of context - completions, single-file edits, code review, small refactors - a 24GB+ local setup is hard to distinguish from frontier. Where cloud still pulls ahead: long-horizon agent loops, refactors that touch hundreds of files, and tasks that depend on the model holding a plan together for tens of minutes. Treat single-digit benchmark deltas as noise; calibrate against your own repo.