Best small LLMs for reasoning
Models that think before answering. Small specialists nearly match frontier-scale on math and logic.
The category that shifted hardest in 2025-2026. A well-trained 14B reasoner now matches or beats 70B-class general models on AIME, GPQA Diamond, and competition-style code, while running on a 16GB GPU.
The catch: reasoning specialists trade conversational warmth for depth. Most production stacks use a fast generalist for casual prompts and route hard ones to a specialist.
What we look for
- Eval honesty - discount benchmarks where contamination is plausible.
- Trace quality - does the chain-of-thought help, or just look thoughtful?
- Latency budget - reasoning runs spend thousands of tokens.
- Toggle support - switchable thinking modes (Qwen3) let you spend tokens only when it matters.
Ranked for a reasoning specialist behind a router.
Picks
-
31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.
-
Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.
-
Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.
-
MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.