Best small LLMs for coding
Code completion, generation, and review. Small specialists now beat generalist 70Bs.
Coding is one of the few domains where small models genuinely caught up. A well-trained 7-14B coder beats a generalist 70B on most public benchmarks for popular languages and runs on a consumer laptop.
What we look for
- Pass@1 on practical evals (HumanEval, MBPP, LiveCodeBench), discounted for over-fit.
- Multi-language coverage - TypeScript, Go, Rust, SQL, not just Python.
- Fill-in-the-middle support, required for editor integrations.
- Long context - repo-scale reasoning needs 32K+.
Ranked for someone building developer tooling.
Picks
-
31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.
-
MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.
-
Apache 2.0 mid-size all-rounder. ~81% MMLU at 150 t/s, 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.
-
Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.
-
Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.
-
MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.