Best small LLMs for agents & function calling
Models that emit clean tool calls and recover from errors gracefully.
Most small models will call a function when asked. Far fewer call the right function with the right arguments under realistic schema sizes, ambiguous prompts, and mid-loop tool errors.
Agentic tuning is the second axis (after reasoning) where small models caught up materially in 2025-2026.
What we look for
- Function-call accuracy on Berkeley Function-Calling Leaderboard, weighted toward simple-tool subsets that match real APIs.
- Schema adherence - no invented fields, no truncated required ones.
- Multi-turn recovery when a tool call errors.
- Native vs. retrofitted - models trained from pretraining with tool tokens (Llama 3.1, Qwen3) outperform retrofits.
- JSON mode reliability - valid output, no truncation, no smart-quote contamination.
Ranked for production agents.
Picks
-
31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.
-
MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.
-
Apache 2.0 mid-size all-rounder. ~81% MMLU at 150 t/s, 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.
-
Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.
-
The ecosystem baseline. Largest community of fine-tunes, quantizations, and inference-engine support of any open small model. Predictable in production.