Fine-tuning Gemma 4 E4B - AI Aggregator

Gemma 4 E4B is the right base when you need a small fine-tuned model that handles multimodal input. The ~4B effective footprint means LoRA adapters fit on consumer GPUs with room to spare, and Apache 2.0 removes the friction earlier Gemma releases had.

Recommended training stacks

HuggingFace TRL with PEFT - canonical multimodal-aware path. Use gemma-4-e4b-pt as the base for vision/audio fine-tunes.
Unsloth - text-only Gemma 4 LoRA tested upstream; vision pathways still maturing.

Watch out for

Multimodal token interleaving - image, audio, and text tokens follow a specific pattern. Deviating produces silent quality loss.
Effective-parameter accounting - LoRA against the inference profile is similar to a 4B; full SFT is heavier than the name suggests.
Always eval on the quantized format you'll ship. Quantization-induced regressions on small multimodal models can be sharper than text-only.