Models / Qwen
Qwen3.5-9B
Strengths
Native multimodal at the 9B mark. 262K context (1M with YaRN). Apache 2.0. Early-fusion training rolls vision into the base model rather than bolting on a separate encoder.
Weaknesses
Hybrid Gated-Delta + sparse-MoE architecture is new enough that some inference stacks lag behind. Tokenizer still favors CJK over English.
Qwen3.5-9B is the small-model upgrade that quietly absorbed the VL line. Where Qwen3 needed a separate Qwen2.5-VL for vision, the 3.5 series trains text and image tokens jointly from scratch, so a single 9B checkpoint covers chat, code, RAG, multilingual, and image / short-video understanding.
The numbers tracked elsewhere are eye-catching ("9B beats 120B on certain tasks"), but the practical takeaway is simpler: this is the new default in the 7-9B class, with a longer native context (262K) than anything else at this size and a permissive license.
When to pick it
- One Apache-2.0 model that handles chat, RAG, coding, and image understanding.
- Long-context workloads that want native 262K without YaRN scaling.
- Multilingual deployments - 200+ languages in training.
When to skip it
- Inference stack doesn't yet support hybrid Gated Delta / sparse MoE attention. Check llama.cpp / vLLM compatibility first.
- Vision quality is "useful," not "frontier" - reach for Gemma 4 31B or Qwen3.6-27B for serious VL workloads.