Models / Phi
Phi-4 Reasoning 14B
Strengths
Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.
Weaknesses
English-only. 32K context. Not a generalist - refusal calibration and conversational warmth lag general-purpose tunes.
Phi-4 Reasoning is the proof-by-counterexample for "you need a big model to reason well." At 14B, it consistently matches or beats reasoning specialists 5x its size on math, logic, and code-reasoning benchmarks. Microsoft trained it specifically with reasoning traces, and it shows.
It is not a generalist. For chat or conversational products, pick something else. For reasoning-heavy backends - solver agents, math tutors, code-review assistants - hard to beat at this size.
When to pick it
- Product needs to reason through hard problems: math, logic, planning, debugging.
- MIT license, fits on a 16GB GPU quantized.
- You're willing to layer a generalist for casual chat.
When to skip it
- One model for both casual chat and reasoning: Qwen3-8B's hybrid thinking mode covers more ground.
- Context needs exceed 32K.