Fine-tuning Qwen3-8B
Base on HF: Qwen/Qwen3-8B-Base · Model page →
The strongest small base for new fine-tuning projects in 2026. Apache 2.0 removes legal friction, the base model's quality means you start from a higher ceiling than Llama 3.1, and the hybrid-thinking architecture is uniquely fine-tunable.
Recommended training stacks
- Axolotl - tested Qwen3 configs upstream; supports the hybrid thinking-mode toggle in training.
- Unsloth - Qwen3 LoRA support landed in late 2025; matches Llama LoRA throughput.
- HuggingFace TRL - tokenizer "just works" via the model card.
Watch out for
- Tokenizer overhead for English - 151K vocab favors CJK; English-only data produces ~5-10% more tokens than Llama. Plan dataset budgets accordingly.
- Thinking-mode prompts - if training data lacks
<think>...</think>traces, the fine-tune may collapse the thinking ability. Either include traces or disable thinking-mode in training. - Heavier safety tuning than Llama. Harder to elicit refusals via fine-tuning if your domain has legitimate need (medical, security research). Plan eval accordingly.