Skip to content

v2.0

Latest
Compare
Choose a tag to compare
@ptrendx ptrendx released this 13 Feb 22:06
· 23 commits to main since this release

Release Notes – Release 2.0

Key Features and Enhancements

  • [C] Added MXFP8 support in functions for casting, GEMMs, normalization, activations.
  • [C] Added generic API for quantized tensors, including generic quantize and dequantize functions.
  • [C] Exposed cuDNN LayerNorm and RMSNorm kernels.
  • [pyTorch] Added MXFP8 recipe.
  • [pyTorch] Added MXFP8 support in Linear, LayerNormLinear, LayerNormMLP, and TransformerLayer modules, and in the operation-based API.
  • [pyTorch] Changed the default quantization scheme from FP8 to MXFP8 for Blackwell GPUs.
  • [pyTorch] Added a custom tensor class for MXFP8 data.
  • [pyTorch] Reduced CPU overhead in FP8/MXFP8 execution.
  • [pyTorch] Enabled efficient handling of FP8 parameters with PyTorch FSDP2.
  • [pyTorch] Expanded the support matrix for Sliding Window Attention.

Fixed Issues

  • [pyTorch] Fixed bugs in capturing CUDA Graphs for MoE models.
  • [pyTorch] Fixed errors with THE FP8 state when loading HuggingFace checkpoints.

Known Issues in This Release

  • [pyTorch] Overlapping tensor-parallel communication with Userbuffers is not supported with MXFP8.
  • [pyTorch] When running linear modules with MXFP8, the memory footprint and tensor-parallel communication volume is larger than necessary.
  • [pyTorch] Userbuffers support in the operation-based API is disabled.

Breaking Changes in This Release

  • [C] Updated minimum requirements to CUDA 12.1 and cuDNN 9.3.
  • [PaddlePaddle] Removed PaddlePaddle integration.
  • [pyTorch] Changed the default quantization from FP8 to MXFP8 for Blackwell GPUs.
  • [pyTorch] Removed support for exporting ONNX models. Support for ONNX export will be reenabled in a future release

Deprecated Features

There are no deprecated features in this release.