Full documentation for MIGraphX is available at https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/.
- Initial code to run on Windows
- Support for gfx120x GPU
- Support for FP8, and INT4
- Support for the Log2 internal operator
- Support for the GCC 14 compiler
- The BitwiseAnd, Scan, SoftmaxCrossEntropyLoss, GridSample, and NegativeLogLikelihoodLoss ONNX operators
- The MatMulNBits, QuantizeLinear/DequantizeLinear, GroupQueryAttention, SkipSimplifiedLayerNormalization, and SimpliedLayerNormalization Microsoft Contrib operators
- Dymamic batch parameter support to OneHot operator
- Split-K as an optional performance improvement
- Scripts to validate ONNX models from the ONNX Model Zoo
- GPU Pooling Kernel
- --mlir flag to the migraphx-driver program to offload entire module to mlir
- Fusing split-reduce with MLIR
- Multiple outputs for the MLIR + Pointwise fusions
- Pointwise fusions with MLIR across reshape operations
- MIGRAPHX_MLIR_DUMP environment variable to dump MLIR modules to MXRs
- The 3 option to MIGRAPHX_TRACE_BENCHMARKING to print the MLIR program for improved debug output
- MIGRAPHX_ENABLE_HIPBLASLT_GEMM environment variable to call hipBlasLt libaries
- MIGRAPHX_VERIFY_DUMP_DIFF to improve the debugging of accuracy issues
- reduce_any and reduce_all options to the Reduce operation via Torch MIGraphX
- Examples for RNNT, and ControlNet
- Switched to MLIR's 3D Convolution operator.
- MLIR is now used for Attention operations by default on gfx942 and newer ASICs.
- Names and locations for VRM specific libraries have changed.
- Use random mode for benchmarking GEMMs and convolutions.
- Python version is now printed with an actual version number.
- Disabled requirements for MIOpen and rocBlas when running on Windows.
- Removed inaccuracte warning messages when using exhaustive-tune.
- Remove the hard coded path in MIGRAPHX_CXX_COMPILER allowing the compiler to be installed in different locations.
- Improved:
- Infrastructure code to enable better Kernel fusions with all supported data types
- Subsequent model compile time by creating a cache for already performant kernels
- Use of Attention fusion with models
- Performance of the Softmax JIT kernel and of the Pooling opterator
- Tuning operations through a new 50ms delay before running the next kernel
- Performance of several convolution based models through an optimized NHWC layout
- Performance for the FP8 datatype
- GPU utilization
- Verification tools
- Debug prints
- Documentation, including gpu-driver utility documentation
- Summary section of the migrahx-driver perf command
- Reduced model compilation time
- Reordered some compiler passes to allow for more fusions
- Preloaded tiles into LDS to improve performance of pointwise transposes
- Exposed the external_data_path property in onnx_options to set the path from onnxruntime
- Fixed a bug with gfx1030 that overwrote dpp_reduce.
- Fixed a bug in 1arg dynamic reshape that created a failure.
- Fixed a bug with dot_broadcast and inner_broadcast that caused compile failures.
- Fixed a bug where some configs were failing when using exhaustive-tune.
- Fixed the ROCM Install Guide URL.
- Fixed an issue while building a whl package due to an apostrophe.
- Fixed the BERT Squad example requirements file to support different versions of Python.
- Fixed a bug that stopped the Vicuna model from compiling.
- Fixed failures with the verify option of migraphx-driver that would cause the application to exit early.
- Added support for ONNX Runtime MIGraphX EP on Windows
- Added FP8 Python API
- Added examples for SD 2.1 and SDXL
- Improved Dynamic Batch to support BERT
- Added a
--test
flag in migraphx-driver to validate the installation - Added support for ONNX Operator: Einsum
- Added uint8 support in ONNX Operators
- Added fusion for group convolutions
- Added rocMLIR conv3d support
- Added rocgdb to the Dockerfile
- Improved ONNX Model Zoo coverage
- Reorganized memcpys with ONNX Runtime to improve performance
- Replaced scaler multibroadcast + unsqueeze with just a multibroadcast
- Improved MLIR kernel selection for multibroadcasted GEMMs
- Improved details of the perf report
- Enable mlir by default for GEMMs with small K
- Allow specifying dot or convolution fusion for mlir with environmental flag
- Improve performance on small reductions by doing multiple reduction per wavefront
- Add additional algebraic simplifications for mul-add-dot sequence of operations involving constants
- Use MLIR attention kernels in more cases
- Enables MIOpen and CK fusions for MI300 gfx arches
- Support for QDQ quantization patterns from Brevitas which have explicit cast/convert nodes before and after QDQ pairs
- Added Fusion of "contiguous + pointwise" and "layout + pointwise" operations which may result in performance gains in certain cases
- Added Fusion for "pointwise + layout" and "pointwise + contiguous" operations which may result in performance gains when using NHWC layout
- Added Fusion for "Pointwise + concat" operation which may help in performance in certain cases
- Fixes a bug in "concat + pointwise" fusion where output shape memory layout wasn't maintained
- Simplifies "slice + concat" pattern in SDXL UNet
- eliminates ZeroPoint/Shift in QuantizeLinear or DeQuantizeLinear ops if zero points values are zeros
- Improved inference performance by fusing Reduce to Broadcast
- Added additional information when printing the perf report
- Improve scalar fusions when not all strides are 0
- Added support for multi outputs in pointwise ops
- Improve reduction fusion with reshape operators
- Use the quantized output when an operator is used again
- Super Resolution model verification failed with FP16
- Suppressed confusing messages when compiling the model
- Mod operator failed to compile with int8 and int32 inputs
- Prevented spawning too many threads for constant propagation when parallel STL is not enabled
- Fixed a bug when running migraphx-driver with the --run 1 option
- Layernorm Accuracy fix: calculations in FP32
- Update Docker generator script to ROCm 6.1 to point at Jammy
- Floating Point exception fix for dim (-1) in reshape operator
- Fixed issue with int8 accuracy and models which were failing due to requiring a fourth bias input
- Fixed missing inputs not previously handled for quantized bias for the weights, and data values of the input matrix
- Fixed order of operations for int8 quantization which were causing inaccuracies and slowdowns
- Removed list initializer of prefix_scan_sum which was causing issues during compilation and resulting in the incorrect constructor to be used at compile
- Fixed the MIGRAPHX_GPU_COMPILE_PARALLEL flag to enable users to control number of threads used for parallel compilation
- Changed default location of libraries with release specific ABI changes
- Reorganized documentation in GitHub
- Removed the
--model
flag with migraphx-driver
- Added beta version of FP8, functional, not performant
- Created a dockerfile with MIGraphX+ONNX Runtime EP+Torch
- Added support for the
Hardmax
,DynamicQuantizeLinear
,Qlinearconcat
,Unique
,QLinearAveragePool
,QLinearSigmoid
,QLinearLeakyRelu
,QLinearMul
,IsInf
operators - Created web site examples for
Whisper
,Llama-2
, andStable Diffusion 2.1
- Created examples of using the ONNX Runtime MIGraphX Execution Provider with the
InceptionV3
andResnet50
models - Updated operators to support ONNX Opset 19
- Enable fuse_pointwise and fuse_reduce in the driver
- Add support for dot-(mul)-softmax-dot offloads to MLIR
- Added Blas auto-tuning for GEMMs
- Added dynamic shape support for the multinomial operator
- Added fp16 to accuracy checker
- Added initial code for running on Windows OS
- Improved the output of migraphx-driver command
- Documentation now shows all environment variables
- Updates needed for general stride support
- Enabled Asymmetric Quantization
- Added ScatterND unsupported reduction modes
- Rewrote softmax for better performance
- General improvement to how quantization is performed to support INT8
- Used problem_cache for gemm tuning
- Improved performance by always using rocMLIR for quantized convolution
- Improved group convolutions by using rocMLIR
- Improved accuracy of fp16 models
- ScatterElements unsupported reduction
- Added concat fusions
- Improved INT8 support to include UINT8
- Allow reshape ops between dq and quant_op
- Improve dpp reductions on navi
- Have the accuracy checker print the whole final buffer
- Added support for handling dynamic Slice and ConstantOfShape ONNX operators
- Add support for the dilations attribute to Pooling ops
- Add layout attribute support for LSTM operator
- Improved performance by removing contiguous for reshapes
- Handle all slice input variations
- Add scales attribute parse in upsample for older opset versions
- Added support for uneven Split operations
- Improved unit testing to run in python virtual environments
- Fixed outstanding issues in autogenerated documentation
- Update model zoo paths for examples
- Fixed promote_literals_test by using additional if condition
- Fixed export API symbols from dynamic library
- Fixed bug in pad operator from dimension reduction
- Fixed using the LD to embed files and enable by default when building shared libraries on linux
- fixed get_version()
- Fixed Round operator inaccuracy
- Fixed wrong size check when axes not present for slice
- Set the .SO version correctly
- Cleanup LSTM and RNN activation functions
- Placed gemm_pointwise at a higher priority than layernorm_pointwise
- Updated README to mention the need to include GPU_TARGETS when building MIGraphX
- Removed unused device kernels from Gather and Pad operators
- Removed int8x4 format
- Support for MI300 GPUs
- Support for TorchMIGraphX via PyTorch
- Boosted overall performance by integrating rocMLIR
- INT8 support for ONNX Runtime
- Support for ONNX version 1.14.1
- Added new operators:
Qlinearadd
,QlinearGlobalAveragePool
,Qlinearconv
,Shrink
,CastLike
, andRandomUniform
- Added an error message for when
gpu_targets
is not set during MIGraphX compilation - Added parameter to set tolerances with
migraphx-driver
verify - Added support for MXR files > 4 GB
- Added
MIGRAPHX_TRACE_MLIR
flag - BETA added capability for using ROCm Composable Kernels via the
MIGRAPHX_ENABLE_CK=1
environment variable
- Improved performance support for INT8
- Improved time precision while benchmarking candidate kernels from CK or MLIR
- Removed contiguous from reshape parsing
- Updated the
ConstantOfShape
operator to support Dynamic Batch - Simplified dynamic shapes-related operators to their static versions, where possible
- Improved debugging tools for accuracy issues
- Included a print warning about
miopen_fusion
while generatingmxr
- General reduction in system memory usage during model compilation
- Created additional fusion opportunities during model compilation
- Improved debugging for matchers
- Improved general debug messages
- Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
- Provided a compile option to improve the accuracy of some models by disabling Fast-Math
- Improved layernorm + pointwise fusion matching to ignore argument order
- Fixed accuracy issue with
ROIAlign
operator - Fixed computation logic for the
Trilu
operator - Fixed support for the DETR model
- Changed MIGraphX version to 2.8
- Extracted the test packages into a separate deb file when building MIGraphX from source
- Removed building Python 2.7 bindings
- hipRTC no longer requires dev packages for MIGraphX runtime and allows the ROCm install to be in a different directory than build time
- Added support for multi-target execution
- Added Dynamic Batch support with C++/Python APIs
- Added
migraphx.create_argument
to Python API - Added dockerfile example for Ubuntu 22.04
- Added TensorFlow supported ops in driver similar to exist onnx operator list
- Added a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
- Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
- You can now use the
fast_math
flag instead ofENV
for GELU - Print message from driver if offload copy is set for compiled program
- Optimized for ONNX Runtime 1.14.0
- Improved compile times by only building for the GPU on the system
- Improved performance of pointwise/reduction kernels when using NHWC layouts
- Loaded specific version of the
migraphx_py
library - Annotated functions with the block size so the compiler can do a better job of optimizing
- Enabled reshape on nonstandard shapes
- Used half HIP APIs to compute max and min
- Added support for broadcasted scalars to unsqueeze operator
- Improved multiplies with dot operator
- Handled broadcasts across dot and concat
- Added verify namespace for better symbol resolution
- Resolved accuracy issues with FP16 resnet50
- Updated cpp generator to handle inf from float
- Fixed assertion error during verify and made DCE work with tuples
- Fixed convert operation for NaNs
- Fixed shape typo in API test
- Fixed compile warnings for shadowing variable names
- Added missing specialization for the
nullptr
hash function
- Bumped version of half library to 5.6.0
- Bumped CI to support ROCm 5.6
- Made building tests optional
- Replaced
np.bool
withbool
per NumPy request
- Removed int8x4 rocBlas calls due to deprecation
- Removed
std::reduce
usage because not all operating systems support it
- Y-Model feature will store tuning information with the optimized model
- Added Python 3.10 bindings
- Accuracy checker tool based on ONNX runtime
- ONNX operators parse_split, and Trilu
- Build support for ROCm MLIR
- Added the
migraphx-driver
flag to print optimizations in Python (--python) - Added JIT implementation of the Gather and Pad operators, which results in better handling for larger tensor sizes
- Improved performance of Transformer-based models
- Improved performance of the
Pad
,Concat
,Gather
, andPointwise
operators - Improved ONNX/pb file loading speed
- Added a general optimize pass that runs several passes, such as
simplify_reshapes
, algebra, and DCE in a loop
- Improved parsing for TensorFlow Protobuf files
- Resolved various accuracy issues with some ONNX models
- Resolved a gcc-12 issue with MIVisionX
- Improved support for larger sized models and batches
- Use
--offload-arch
instead of--cuda-gpu-arch
for the HIP compiler - Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow
- Changes inside JIT to temporarily use cosine to compute sine function
- Changed version and location of third-party build dependencies in order to pick up fixes