Models / Qwen
Qwen3-8B Instruct
general-chatcodingreasoningragagentsmultilingualextractiongpu-8gbgpu-16gbgpu-24gbgpu-48gbapple-silicon-16gbapple-silicon-32gbcpu-16gbcpu-32gbdatacenter
Strengths
Strong all-rounder in the 7-8B class. Apache 2.0. 32K native context, 131K with YaRN. Hybrid 'thinking' mode you can toggle per request.
Weaknesses
Tokenizer favors CJK over English (more tokens per byte for English-only deployments). Safety guardrails feel over-tuned in some domains.
Qwen3-8B is the small-model pick most teams reach for in 2026. It out-benches Llama 3.1 8B and Mistral 7B on essentially every public eval, with a clean Apache 2.0 license.
The hybrid "thinking" mode is the architectural shift worth knowing: at inference you can toggle deeper chain-of-thought per request, trading latency for accuracy. Genuinely useful for agentic flows that occasionally need to plan.
When to pick it
- One Apache-2.0 small model that handles chat, code, RAG, and light reasoning.
- Multilingual users (especially CJK).
- Long context (131K via YaRN) without Llama's licensing complexity.
When to skip it
- Your inference stack doesn't support the hybrid thinking-mode toggle.
- English-only and tokens-per-dollar matters: Llama is more efficient on English.