AI Aggregator

Categories

Best small LLMs for coding

Code completion, generation, and review. Small specialists now beat generalist 70Bs.

Coding is one of the few domains where small models genuinely caught up. A well-trained 7-14B coder beats a generalist 70B on most public benchmarks for popular languages and runs on a consumer laptop.

What we look for

  • Pass@1 on practical evals (HumanEval, MBPP, LiveCodeBench), discounted for over-fit.
  • Multi-language coverage - TypeScript, Go, Rust, SQL, not just Python.
  • Fill-in-the-middle support, required for editor integrations.
  • Long context - repo-scale reasoning needs 32K+.

Ranked for someone building developer tooling.

Picks

  1. #1 Qwen3.6-27B 27.0B · Apache 2.0

    Flagship-level coding in a 27B dense footprint. SWE-Bench Verified 77.2%, Terminal-Bench 2.0 59.3% (matches Claude 4.5 Opus). 262K native context, multimodal, Apache 2.0.

  2. #2 Gemma 4 31B 31.0B · Apache 2.0

    31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

  3. #3 Qwen3.5-9B 9.0B · Apache 2.0

    Native multimodal at the 9B mark. 262K context (1M with YaRN). Apache 2.0. Early-fusion training rolls vision into the base model rather than bolting on a separate encoder.

  4. #4 Qwen3-Coder-Next 3.0B · Apache 2.0

    MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.

  5. #5 gpt-oss-20b 3.6B · Apache 2.0

    OpenAI's small open-weight model. 21B total / 3.6B active MoE, runs in 16GB at MXFP4. Configurable reasoning effort (low/medium/high). Matches o3-mini on common reasoning evals.

  6. #6 Mistral Small 3.2 24B 24.0B · Apache 2.0

    Apache 2.0 mid-size all-rounder. ~81% MMLU and 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.

  7. #7 Phi-4 Reasoning 14B 14.0B · MIT

    Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.

  8. #8 Qwen3-8B Instruct 8.2B · Apache 2.0

    Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.

  9. #9 Phi-4-mini 3.8B 3.8B · MIT

    MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.