NVIDIA dropped Nemotron 3 Super at GTC — a free open-weight 120B model that scored 60.47% on SWE-Bench Verified, the highest open-weight coding score ever recorded. We break down what it means for developers.
NVIDIA Nemotron 3 Super is a 120-billion-parameter hybrid model released at GTC on March 11 2026. Despite its 120B parameter count it activates only 12 billion parameters per token — which means you get the reasoning depth of a 120B model at the compute cost of something far smaller. The architecture combines three different approaches: Mamba-2 state space layers, Transformer attention layers and a new mixture-of-experts design NVIDIA calls LatentMoE. The result is extraordinary throughput — 2.2x higher inference speed than GPT-OSS-120B and up to 7.5x faster than Qwen3.5-122B on comparable hardware. The model is available for free download on Hugging Face under the NVIDIA Open Model License Agreement and runs on 64GB of RAM making self-hosting genuinely practical for enterprise teams.
On SWE-Bench Verified the benchmark that measures real-world software engineering capability Nemotron 3 Super scored 60.47% — the highest score ever recorded for an open-weight model. For context Claude Opus 4.6 and GPT-5.3 Codex sit around 80% on the same benchmark meaning there is a real 20-point gap for the most complex tasks. But the comparison that matters most for most developers is not against the frontier models — it is against what you were previously paying for. Nemotron 3 Super outperforms GPT-OSS-120B's 41.90% significantly. The 1 million token context window holds 91.75% accuracy at maximum length on the RULER benchmark versus GPT-OSS-120B's 22.30% — a dramatic difference for long-context agentic workflows. It is already deployed in production at Perplexity, CodeRabbit, Factory, Greptile, Palantir, Cadence and Siemens.
Nemotron 3 Super is not a replacement for Claude Opus 4.6 or GPT-5.4 on the most complex tasks — the 20-point SWE-Bench gap is real and consistent on multi-file refactors with intricate dependencies. Where Nemotron wins decisively is on cost and data privacy. For teams running high-volume inference where Claude API costs would be prohibitive Nemotron self-hosted eliminates API costs entirely. For regulated industries or companies with strict data sovereignty requirements self-hosting Nemotron means code never leaves your infrastructure. The benchmark sweet spot: Nemotron handles the 80% of coding tasks where its 60% SWE-Bench score is good enough and you route only the most complex tasks to Claude or GPT-5.4.
Hardware requirements: 64GB RAM minimum for inference. NVIDIA recommends 80GB GPU VRAM for optimal performance using NVFP4 precision. Practical deployment uses vLLM llama.cpp Ollama or Together infrastructure — all have immediate community support for Nemotron 3 Super. The full training recipe including 25 trillion tokens of pre-training data with a June 2025 cutoff is publicly released making fine-tuning and continued training possible in ways closed models cannot match. For teams already running local LLM infrastructure Nemotron slots in as a direct upgrade to previous Mistral or Llama deployments at significantly higher capability.
New honest reviews every week. Zero sponsorships. Zero fluff.
Subscribe Free →