AI Aggregator

Categories

Best small LLMs for reasoning

Models that think before answering. Small specialists nearly match frontier-scale on math and logic.

The category that shifted hardest in 2025-2026. A well-trained 14B reasoner now matches or beats 70B-class general models on AIME, GPQA Diamond, and competition-style code, while running on a 16GB GPU.

The catch: reasoning specialists trade conversational warmth for depth. Most production stacks use a fast generalist for casual prompts and route hard ones to a specialist.

What we look for

  • Eval honesty - discount benchmarks where contamination is plausible.
  • Trace quality - does the chain-of-thought help, or just look thoughtful?
  • Latency budget - reasoning runs spend thousands of tokens.
  • Toggle support - switchable thinking modes (Qwen3) let you spend tokens only when it matters.

Ranked for a reasoning specialist behind a router.

Picks

  1. #1 Qwen3.6-27B 27.0B · Apache 2.0

    Flagship-level coding in a 27B dense footprint. SWE-Bench Verified 77.2%, Terminal-Bench 2.0 59.3% (matches Claude 4.5 Opus). 262K native context, multimodal, Apache 2.0.

  2. #2 Gemma 4 31B 31.0B · Apache 2.0

    31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

  3. #3 Qwen3.5-9B 9.0B · Apache 2.0

    Native multimodal at the 9B mark. 262K context (1M with YaRN). Apache 2.0. Early-fusion training rolls vision into the base model rather than bolting on a separate encoder.

  4. #4 Nemotron 3 Nano 30B-A3B 3.5B · NVIDIA Nemotron Open Model License

    Hybrid Mamba2-Transformer-MoE: 3.5B active out of 30B total, 256K default context (1M max). Trained from scratch on 25T tokens. Strong agentic and tool-calling post-training.

  5. #5 gpt-oss-20b 3.6B · Apache 2.0

    OpenAI's small open-weight model. 21B total / 3.6B active MoE, runs in 16GB at MXFP4. Configurable reasoning effort (low/medium/high). Matches o3-mini on common reasoning evals.

  6. #6 Phi-4 Reasoning 14B 14.0B · MIT

    Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.

  7. #7 Qwen3-8B Instruct 8.2B · Apache 2.0

    Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.

  8. #8 Phi-4-mini 3.8B 3.8B · MIT

    MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.