Nemotron 3 Nano 30B-A3B

Nemotron 3 Nano is NVIDIA's small open-weight entry in the v3 family: 30B total, 3.5B active, hybrid Mamba-2 / Transformer / MoE architecture trained from scratch. The architecture is the editorial story - 23 Mamba-2 layers and 23 MoE layers interleaved with six attention layers, which trades some quality-per-parameter for substantially better long-context throughput than a pure transformer.

Post-training emphasized agentic flows: tool calling, instruction following, code, and math, with the 3.5B active count keeping latency low even on consumer-grade GPUs.

When to pick it

Long-context agent loops where Mamba's linear attention pays for itself.
16-24GB hardware where MoE active-param speed matters more than total VRAM (Q4_K_XL fits 16GB; partial expert offload extends further).
You can accept the NVIDIA Nemotron Open Model License terms.

When to skip it

Strict Apache / MIT requirement - reach for Qwen3.5-9B or Gemma 4 instead.
Vision or multimodal tasks - Nano is text-only; pick the Nemotron 3 Nano Omni variant or a different family.
Inference stack hasn't shipped Mamba-2 hybrid support yet.

Strengths

Weaknesses

When to pick it

When to skip it