AI Aggregator

Categories

Best small LLMs for vision-language

Models that take images alongside text. Native multimodal pretraining is the 2026 default.

A year ago, "small open VLM" meant LLaVA-style: take a text LLM, glue on a vision encoder, hope it works. The 2026 generation is mostly trained from scratch with multimodal tokens, which improves quality and changes how you fine-tune.

The use cases that actually want a small VLM are narrow but real: screenshots, document/form extraction, accessibility, simple visual QA. Anything heavy still wants a larger model.

What we look for

  • Native vs. adapter - native models generalize across image types more cleanly.
  • Resolution handling - fixed-input models break on wide screenshots and tall mobile UIs.
  • OCR-adjacent quality - receipts, forms, screenshots are the bread and butter.
  • Modality breadth - text+image is table stakes; video and audio are still uneven.
  • Domain fine-tuning - medical, satellite, scientific imagery often needs adaptation.

Ranked for screenshot, document, and visual-QA workloads.

Picks

  1. #1 Gemma 4 31B 31.0B · Apache 2.0

    31B dense, Apache 2.0, 256K context, multimodal. AIME 2026 89.2%, Codeforces ELO 2150 - leads open dense models in its size class for math and competitive programming. Bridges 'serious work' and 'fits on a 24-48GB GPU'.

  2. #2 Gemma 4 E4B 4.0B · Apache 2.0

    Native multimodal (text, image, video, audio) at edge sizes. Apache 2.0. ~4B effective inference footprint built to preserve RAM and battery on consumer devices.

  3. #3 Mistral Small 3.2 24B 24.0B · Apache 2.0

    Apache 2.0 mid-size all-rounder. ~81% MMLU at 150 t/s, 3x faster than Llama 3.3 70B at similar quality. 128K context. Vision support added in 3.x line.

  4. #4 Qwen2.5-VL 7B Instruct 7.6B · Apache 2.0

    Vision-language specialist at 7B. Beats Llama 3.2-Vision 11B on MMMU (58.6), MathVista (68.2), DocVQA (95.7). Apache 2.0. Variable resolution and aspect ratio support, video frames.