AI Aggregator

Categories

Best small LLMs for coding

Code completion, generation, and review. Small specialists now beat generalist 70Bs.

Coding is one of the few domains where small models genuinely caught up. A well-trained 7-14B coder beats a generalist 70B on most public benchmarks for popular languages and runs on a consumer laptop.

What we look for

  • Pass@1 on practical evals (HumanEval, MBPP, LiveCodeBench), discounted for over-fit.
  • Multi-language coverage - TypeScript, Go, Rust, SQL, not just Python.
  • Fill-in-the-middle support, required for editor integrations.
  • Long context - repo-scale reasoning needs 32K+.

Ranked for someone building developer tooling.

Picks

  1. #1 Gemma 4 31B 31.0B · Apache 2.0

    31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

  2. #2 Qwen3-Coder-Next 3.0B · Apache 2.0

    MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.

  3. #3 Mistral Small 3.2 24B 24.0B · Apache 2.0

    Apache 2.0 mid-size all-rounder. ~81% MMLU at 150 t/s, 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.

  4. #4 Phi-4 Reasoning 14B 14.0B · MIT

    Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.

  5. #5 Qwen3-8B Instruct 8.2B · Apache 2.0

    Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.

  6. #6 Phi-4-mini 3.8B 3.8B · MIT

    MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.