NVIDIA's 4-Bit Pretraining Breakthrough: NVFP4 and the 12B Hybrid Model
Introduction
Training large language models (LLMs) at scale often relies on FP8 precision, but pushing down to 4-bit floating point has been a formidable challenge due to compression of dynamic range and amplification of quantization errors over many trillions of tokens. NVIDIA's latest research tackles this head-on by introducing a novel pretraining methodology built around NVFP4, a 4-bit microscaling format natively supported by Blackwell Tensor Cores. To validate the approach, they trained a 12-billion-parameter hybrid Mamba-Transformer on 10 trillion tokens—the longest publicly documented 4-bit training run to date. The resulting model achieves 62.58% on MMLU-Pro 5-shot, nearly matching the FP8 baseline of 62.62%, and is already integrated into NVIDIA's Transformer Engine.

What Is NVFP4?
Understanding NVFP4 requires a quick look at microscaling formats. In standard MX (microscaling) formats, a block of low-precision elements shares a single scale factor to map back to a wider numerical range during matrix multiplication. MXFP4, for instance, uses 32-element blocks where each element is stored as E2M1 (1 sign, 2 exponent, 1 mantissa bit), encoding only a handful of values around zero. Block scale factors in MXFP4 are UE8M0, limiting them to powers of two.
NVFP4 brings three key innovations:
- Smaller block size: The block size shrinks from 32 to 16 elements, narrowing the dynamic range each scale must cover.
- E4M3 block scales: Instead of UE8M0, block scale factors are stored in E4M3, trading exponent range for mantissa precision. This allows the per-block absolute maximum (amax) to be mapped much closer to the FP4 maximum representable value.
- Two-level scaling: An additional FP32 per-tensor scale remaps values so that the E4M3 block scales remain in range. As a result, at least 6.25% of values in each block—the per-block amax—are represented at near-FP8 precision, while the rest stay in FP4.
This design yields impressive hardware benefits: on NVIDIA Blackwell, FP4 GEMMs run at 4× BF16 throughput on GB200 and 6× on GB300, translating to roughly 2× and 3× speedups over FP8. Memory footprint for operands is approximately halved compared to FP8.
What Gets Quantized—and What Doesn't
A crucial aspect of the methodology is selective quantization. Only the GEMMs inside linear (fully-connected) layers for forward propagation (Fprop), backward gradient (Dgrad), and weight gradient (Wgrad) run in NVFP4. Other components remain in higher precision:
- Embeddings and the output projection head stay in BF16 or FP32.
- Normalization layers and non-linearities are also kept in BF16/FP32.
- All attention components—including softmax and the query-key and attention score-value batched GEMMs—remain in BF16 or FP32.
- Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are stored in FP32.
- Tensor parallel reductions are performed in BF16.
The Training Methodology
Naively quantizing every linear-layer GEMM to NVFP4 with default settings (e.g., 1×16 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) causes early divergence in training. NVIDIA's research describes a four-part training methodology that overcomes this instability. While the full details are available in the paper, the core components include careful handling of scaling factors, rounding schemes, and mixed-precision updates to maintain gradient fidelity over long token horizons.

Key aspects of the methodology:
- NVFP4 quantization as described above, with per-block amax alignment and FP32 per-tensor scaling.
- Selective quantization to preserve critical layers and operations in higher precision.
- FP32 accumulation for weight gradients and optimizer states to prevent error drift.
- Gradient scaling and loss scaling to manage dynamic range throughout training.
This approach ensures that the 12B model can converge reliably across 10 trillion tokens, matching the performance of an FP8 baseline.
Results and Implications
The trained model achieves 62.58% on MMLU-Pro 5-shot compared to 62.62% for the FP8 baseline—a negligible difference of 0.04%. This demonstrates that 4-bit pretraining can be practical without sacrificing accuracy. The integration with Transformer Engine means developers can readily adopt NVFP4 in their own workflows.
Beyond the immediate result, this work paves the way for even more efficient training of frontier models. With 2–3× speedups over FP8 and halved memory footprint, NVFP4 lowers the computational barrier for training massive LLMs, potentially enabling larger models or longer training runs within the same budget.
For further details, see the original paper on arXiv: 2509.25149.
↑ What Is NVFP4? | Quantization Strategy | Training Methodology | Results
Related Articles
- 10 Ways Gemini’s New File Generation Feature Transforms Your Workflow
- Anthropic Unveils Breakthrough Tool That Lets Anyone Read AI's Inner Thoughts in Plain English
- Why We Think ChatGPT Changed — The Real Reason It Feels Different Now
- Breaking: Researchers Raise Alarm Over Rising Threat of 'Jailbreak' Attacks on AI Chatbots
- OpenAI Unveils Three New Audio Models for Real-Time Voice, Makes Realtime API Generally Available
- How SentinelOne’s Autonomous AI Defense Stopped a Zero-Day Supply Chain Attack Targeting LLM Infrastructure
- 8 Key Insights into Scaling Interaction Discovery for Large Language Models
- Building Self-Improving AI: A Step-by-Step Guide to MIT's SEAL Framework