Fine-tuning Gemma 4 E4B
Base on HF: google/gemma-4-e4b-pt · Model page →
Gemma 4 E4B is the right base when you need a small fine-tuned model that handles multimodal input. The ~4B effective footprint means LoRA adapters fit on consumer GPUs with room to spare, and Apache 2.0 removes the friction earlier Gemma releases had.
Recommended training stacks
- HuggingFace TRL with PEFT - canonical multimodal-aware path. Use
gemma-4-e4b-ptas the base for vision/audio fine-tunes. - Unsloth - text-only Gemma 4 LoRA tested upstream; vision pathways still maturing.
Watch out for
- Multimodal token interleaving - image, audio, and text tokens follow a specific pattern. Deviating produces silent quality loss.
- Effective-parameter accounting - LoRA against the inference profile is similar to a 4B; full SFT is heavier than the name suggests.
- Always eval on the quantized format you'll ship. Quantization-induced regressions on small multimodal models can be sharper than text-only.