Best small LLMs for rag / long context

Most modern small models advertise 128K context. Many fall over past 32K. The interesting question isn't the marketed window, it's the effective one: how far in can the model actually retrieve before recall craters and answers become confident hallucinations.

In RAG you also care about how the model handles conflicting passages, how willing it is to say "the documents don't say," and how cleanly it cites the source.

What we look for

Effective context length via RULER and Needle-in-a-Haystack with distractors.
Citation discipline - honors "cite from documents only" without smuggling pretrained knowledge.
Refusal calibration on missing info - saying "not in the documents" is a feature.
Throughput at long context - GQA and efficient kernels matter at scale.

Ranked for production RAG with 32K-256K context.

Picks

#1 Gemma 4 31B 31.0B · Apache 2.0

31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

#2 Qwen3-Coder-Next 3.0B · Apache 2.0

MoE coder built for agentic workflows. 3B active / 80B total. >70% on SWE-Bench Verified with the SWE-Agent scaffold. 256K native context. Apache 2.0.

#3 Mistral Small 3.2 24B 24.0B · Apache 2.0

Apache 2.0 mid-size all-rounder. ~81% MMLU at 150 t/s, 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.

#4 Qwen3-8B Instruct 8.2B · Apache 2.0

Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.

#5 Qwen2.5-VL 7B Instruct 7.6B · Apache 2.0

Vision-language specialist at 7B. Beats Llama 3.2-Vision 11B on MMMU (58.6), MathVista (68.2), DocVQA (95.7). Apache 2.0. Variable resolution and aspect ratio support, video frames.

#6 Llama 3.2 3B Instruct 3.2B · Llama 3.2 Community

Meta's mobile-targeted small model. Largest ecosystem at this size class. 128K context. Solid baseline for on-device assistants where ecosystem maturity matters.

#7 Llama 3.1 8B Instruct 8.0B · Llama 3.1 Community

The ecosystem baseline. Largest community of fine-tunes, quantizations, and inference-engine support of any open small model. Predictable in production.