AI Aggregator

Models  /  Phi

Phi-4 Reasoning 14B

microsoft/Phi-4-reasoning

reasoningcodinggeneral-chatgpu-16gbgpu-24gbgpu-48gbapple-silicon-16gbapple-silicon-32gbcpu-32gbdatacenter
Parameters
14.0B
Family
Phi
License
MIT
Context length
32,768 tokens
Languages
en
Modalities
text
Released
2025-04-30
HF downloads (30d)
17,439
Stats updated
-1 days ago

Strengths

Punches above its weight on reasoning. Beats DeepSeek-R1-Distill-Llama-70B on AIME and GPQA at 5x smaller. Comparable to full DeepSeek-R1 (671B) on AIME 2025. MIT license.

Weaknesses

English-only. 32K context. Not a generalist - refusal calibration and conversational warmth lag general-purpose tunes.

Phi-4 Reasoning is the proof-by-counterexample for "you need a big model to reason well." At 14B, it consistently matches or beats reasoning specialists 5x its size on math, logic, and code-reasoning benchmarks. Microsoft trained it specifically with reasoning traces, and it shows.

It is not a generalist. For chat or conversational products, pick something else. For reasoning-heavy backends - solver agents, math tutors, code-review assistants - hard to beat at this size.

When to pick it

  • Product needs to reason through hard problems: math, logic, planning, debugging.
  • MIT license, fits on a 16GB GPU quantized.
  • You're willing to layer a generalist for casual chat.

When to skip it

  • One model for both casual chat and reasoning: Qwen3-8B's hybrid thinking mode covers more ground.
  • Context needs exceed 32K.