AI Aggregator

Categories

Best small LLMs for agents & function calling

Models that emit clean tool calls and recover from errors gracefully.

Most small models will call a function when asked. Far fewer call the right function with the right arguments under realistic schema sizes, ambiguous prompts, and mid-loop tool errors.

Agentic tuning is the second axis (after reasoning) where small models caught up materially in 2025-2026.

What we look for

  • Function-call accuracy on Berkeley Function-Calling Leaderboard, weighted toward simple-tool subsets that match real APIs.
  • Schema adherence - no invented fields, no truncated required ones.
  • Multi-turn recovery when a tool call errors.
  • Native vs. retrofitted - models trained from pretraining with tool tokens (Llama 3.1, Qwen3) outperform retrofits.
  • JSON mode reliability - valid output, no truncation, no smart-quote contamination.

Ranked for production agents.

Picks

  1. #1 Qwen3.6-27B 27.0B · Apache 2.0

    Flagship-level coding in a 27B dense footprint. SWE-Bench Verified 77.2%, Terminal-Bench 2.0 59.3% (matches Claude 4.5 Opus). 262K native context, multimodal, Apache 2.0.

  2. #2 Gemma 4 31B 31.0B · Apache 2.0

    31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

  3. #3 Qwen3.5-9B 9.0B · Apache 2.0

    Native multimodal at the 9B mark. 262K context (1M with YaRN). Apache 2.0. Early-fusion training rolls vision into the base model rather than bolting on a separate encoder.

  4. #4 Qwen3-Coder-Next 3.0B · Apache 2.0

    MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.

  5. #5 Nemotron 3 Nano 30B-A3B 3.5B · NVIDIA Nemotron Open Model License

    Hybrid Mamba2-Transformer-MoE: 3.5B active out of 30B total, 256K default context (1M max). Trained from scratch on 25T tokens. Strong agentic and tool-calling post-training.

  6. #6 gpt-oss-20b 3.6B · Apache 2.0

    OpenAI's small open-weight model. 21B total / 3.6B active MoE, runs in 16GB at MXFP4. Configurable reasoning effort (low/medium/high). Matches o3-mini on common reasoning evals.

  7. #7 Mistral Small 3.2 24B 24.0B · Apache 2.0

    Apache 2.0 mid-size all-rounder. ~81% MMLU and 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.

  8. #8 Qwen3-8B Instruct 8.2B · Apache 2.0

    Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.

  9. #9 Llama 3.1 8B Instruct 8.0B · Llama 3.1 Community

    The ecosystem baseline. Largest community of fine-tunes, quantizations, and inference-engine support of any open small model. Predictable in production.