Best small LLMs for on-device / mobile

On-device has different constraints than server inference: RAM matters more than throughput, battery as much as latency, and running out of memory on a phone is a much worse failure mode than a server 503.

The 2026 wave of "effective parameter" models (Gemma 4 E2B/E4B, smaller Qwen3 variants) trades training complexity for footprints that fit consumer hardware. Native multimodal at this size is genuinely new.

What we look for

Quantized quality at Q4_K_M / Q5_K_M, not bf16. If it collapses below int8, it's not on-device.
Cold-start time on Apple Silicon and Snapdragon.
Memory ceiling - total RAM needed, not just weight size.
License clarity for redistribution when shipping weights inside an app.
Multimodal feasibility - useful screenshots / photos / short audio, or text-only?

Ranked for shipping inference into a mobile app or edge device.

Picks

#1 Gemma 4 E4B 4.0B · Apache 2.0

Native multimodal (text, image, video, audio) at edge sizes. Apache 2.0. ~4B effective inference footprint built to preserve RAM and battery on consumer devices.

#2 Phi-4-mini 3.8B 3.8B · MIT

MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.

#3 Llama 3.2 3B Instruct 3.2B · Llama 3.2 Community

Meta's mobile-targeted small model. Largest ecosystem at this size class. 128K context. Solid baseline for on-device assistants where ecosystem maturity matters.