Best small LLMs for on-device / mobile
Small enough to run on a phone, laptop, or embedded device. The 1-4B effective tier is interesting.
On-device has different constraints than server inference: RAM matters more than throughput, battery as much as latency, and running out of memory on a phone is a much worse failure mode than a server 503.
The 2026 wave of "effective parameter" models (Gemma 4 E2B/E4B, smaller Qwen3 variants) trades training complexity for footprints that fit consumer hardware. Native multimodal at this size is genuinely new.
What we look for
- Quantized quality at Q4_K_M / Q5_K_M, not bf16. If it collapses below int8, it's not on-device.
- Cold-start time on Apple Silicon and Snapdragon.
- Memory ceiling - total RAM needed, not just weight size.
- License clarity for redistribution when shipping weights inside an app.
- Multimodal feasibility - useful screenshots / photos / short audio, or text-only?
Ranked for shipping inference into a mobile app or edge device.
Picks
-
Native multimodal (text, image, video, audio) at edge sizes. Apache 2.0. ~4B effective inference footprint built to preserve RAM and battery on consumer devices.
-
MIT license, 67% MMLU at 3.8B. Inherits the Phi reasoning lineage in a small footprint. 128K context, 200K-token vocabulary for multilingual support. Function-calling support.
-
Meta's mobile-targeted small model. Largest ecosystem at this size class. 128K context. Solid baseline for on-device assistants where ecosystem maturity matters.