From 23e00b131ec4be7a0927619e36a1425b39c420de Mon Sep 17 00:00:00 2001
From: Viktoriia <viktoria.gvozdeva@intel.com>
Date: Wed, 22 Jan 2025 19:54:07 +0100
Subject: [PATCH 01/29] doc: update relase notes

---
 RELEASE_NOTES.md | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)
 create mode 100644 RELEASE_NOTES.md

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
new file mode 100644
index 00000000000..e577319e537
--- /dev/null
+++ b/RELEASE_NOTES.md
@@ -0,0 +1,38 @@
+# Performance Optimizations
+## Intel Architecture Processors
+  * Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
+  * Added support and improved perfomance for fp8 matmul with bf16/fp16.
+
+## Intel Graphics Products
+  * Introduced initial optimizations for GPUs based on Xe3 architecture.
+  * Improved performance for convolution for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
+
+## AArch64-based Processors
+
+# Functionality
+  * Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
+  * Enabled support for matmul primitive with grouped quantization on weight along N dimension
+  * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
+  * Introduced support for grouped scales and zero points in reorder primitive.
+  * Enabled support for 4d weight scale in matmul primitive.
+  * [experimental] Extended microkernel API:
+		Introduced int4 quantization support.
+		Fpmath mode API
+# Usability
+  * Relaxed memory object lifetime requirements created with CPU engine and SYCL runtime. New behavior is aligned with GPU engine.
+  * Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
+  * Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
+  * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
+# Validation
+  * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
+# Deprecated Functionality
+
+# Breaking Changes
+  * Updated minimal supported CMake version to 3.13 (was 2.8.12).
+  * Updated minimal supported GCC version to 8.0 (was 4.8).
+  * Updated minimal supported Clang version to 11.0 (was 3.0).
+# Thanks to these Contributors
+
+This release contains contributions from the [project core team] as well as Michał Górny @mgorny, Fadi Arafeh @fadara01, John Osorio @kala855, Ravi Pushkar @rpushkarr, Marek Michalowski @michalowski-arm, Renato Barros Arantes @renato-arantes, Ryo Suzuki @Ryo-not-rio, Varad Ahirwadkar @varad-ahirwadkar, Tadej Ciglarič @t4c1, Nikhil Sharma @nikhilfujitsu, @taoye9, @Shreyas-fuj, @raistefintel. We would also like to thank everyone who asked questions and reported issues.
+
+[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md

From 8be4815a3ce843d79b20aef01f0883d1779f2bd8 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:01:59 +0100
Subject: [PATCH 02/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index e577319e537..950c06b8dad 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -6,7 +6,9 @@
 ## Intel Graphics Products
   * Introduced initial optimizations for GPUs based on Xe3 architecture.
   * Improved performance for convolution for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
-
+  * Improved performance of the following subgraphs with Graph API
+    * Scaled dot-product Attention (SDPA) [with causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
+    * Scaled dot-product Attention (SDPA) [with compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
 ## AArch64-based Processors
 
 # Functionality

From 73e9276032c87a792943e320214b94ba4fb16816 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:05:05 +0100
Subject: [PATCH 03/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 950c06b8dad..f8a588350a8 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -25,6 +25,7 @@
   * Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
   * Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
+  * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
 # Deprecated Functionality

From f87e982c42f3016fb3e300fa94b72a20495da5db Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:05:34 +0100
Subject: [PATCH 04/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index f8a588350a8..cbfd0478a49 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -34,6 +34,7 @@
   * Updated minimal supported CMake version to 3.13 (was 2.8.12).
   * Updated minimal supported GCC version to 8.0 (was 4.8).
   * Updated minimal supported Clang version to 11.0 (was 3.0).
+  * Removed support for SYCL older than 2020
 # Thanks to these Contributors
 
 This release contains contributions from the [project core team] as well as Michał Górny @mgorny, Fadi Arafeh @fadara01, John Osorio @kala855, Ravi Pushkar @rpushkarr, Marek Michalowski @michalowski-arm, Renato Barros Arantes @renato-arantes, Ryo Suzuki @Ryo-not-rio, Varad Ahirwadkar @varad-ahirwadkar, Tadej Ciglarič @t4c1, Nikhil Sharma @nikhilfujitsu, @taoye9, @Shreyas-fuj, @raistefintel. We would also like to thank everyone who asked questions and reported issues.

From 6a59eb1cc06ec156e073bd2a3865495b4662625f Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:12:06 +0100
Subject: [PATCH 05/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index cbfd0478a49..ee9b0ee3785 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -21,7 +21,7 @@
 		Introduced int4 quantization support.
 		Fpmath mode API
 # Usability
-  * Relaxed memory object lifetime requirements created with CPU engine and SYCL runtime. New behavior is aligned with GPU engine.
+  * With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
   * Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
   * Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

From e12db5a95f80adb32a9028c35fa730f4a3eed9de Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:12:20 +0100
Subject: [PATCH 06/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index ee9b0ee3785..50d2df9dbd6 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -17,6 +17,7 @@
   * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
   * Introduced support for grouped scales and zero points in reorder primitive.
   * Enabled support for 4d weight scale in matmul primitive.
+  * Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder.
   * [experimental] Extended microkernel API:
 		Introduced int4 quantization support.
 		Fpmath mode API

From c4cafcd65a1aae38b520523fdfbe5f34cfae0f9a Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 12:12:30 +0100
Subject: [PATCH 07/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 50d2df9dbd6..acdcbe5d364 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -14,6 +14,7 @@
 # Functionality
   * Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
   * Enabled support for matmul primitive with grouped quantization on weight along N dimension
+  * Graph API: new [`select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations.
   * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
   * Introduced support for grouped scales and zero points in reorder primitive.
   * Enabled support for 4d weight scale in matmul primitive.

From 632296f39777954c7e41adfc5669a1ff25473a44 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:28:08 +0100
Subject: [PATCH 08/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index acdcbe5d364..95bdf60d210 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -18,6 +18,7 @@
   * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
   * Introduced support for grouped scales and zero points in reorder primitive.
   * Enabled support for 4d weight scale in matmul primitive.
+  * Graph API: added support for Quantized and non-quantized Gated MLP pattern
   * Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder.
   * [experimental] Extended microkernel API:
 		Introduced int4 quantization support.

From 128ba81d41449cf94ccc7845da7844edc023df3a Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:28:23 +0100
Subject: [PATCH 09/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 95bdf60d210..49fcf67c5c2 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -7,7 +7,7 @@
   * Introduced initial optimizations for GPUs based on Xe3 architecture.
   * Improved performance for convolution for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
   * Improved performance of the following subgraphs with Graph API
-    * Scaled dot-product Attention (SDPA) [with causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
+    * Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
     * Scaled dot-product Attention (SDPA) [with compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
 ## AArch64-based Processors
 

From b1b6fc45805e4953cf2b695fb5f5d141ae3cc912 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:31:36 +0100
Subject: [PATCH 10/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Vadim Pirogov <vadim.o.pirogov@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 49fcf67c5c2..6f774740a7d 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -40,6 +40,6 @@
   * Removed support for SYCL older than 2020
 # Thanks to these Contributors
 
-This release contains contributions from the [project core team] as well as Michał Górny @mgorny, Fadi Arafeh @fadara01, John Osorio @kala855, Ravi Pushkar @rpushkarr, Marek Michalowski @michalowski-arm, Renato Barros Arantes @renato-arantes, Ryo Suzuki @Ryo-not-rio, Varad Ahirwadkar @varad-ahirwadkar, Tadej Ciglarič @t4c1, Nikhil Sharma @nikhilfujitsu, @taoye9, @Shreyas-fuj, @raistefintel. We would also like to thank everyone who asked questions and reported issues.
+This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.
 
 [project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/MAINTAINERS.md

From 56d15628401d254b323cffc556470ab1b8d2fa6e Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:31:49 +0100
Subject: [PATCH 11/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
---
 RELEASE_NOTES.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 6f774740a7d..ac830fab369 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -31,6 +31,8 @@
   * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
+  * Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
+  * Extended benchdnn with support and validation for the number of partition returned from the test JSON files.
 # Deprecated Functionality
 
 # Breaking Changes

From b74779a82a0f7c7fe47f0f16d7e3437c3f815b47 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:32:04 +0100
Subject: [PATCH 12/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index ac830fab369..c4b3e64c18b 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -14,7 +14,7 @@
 # Functionality
   * Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
   * Enabled support for matmul primitive with grouped quantization on weight along N dimension
-  * Graph API: new [`select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations.
+  * Graph API: new [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations.
   * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
   * Introduced support for grouped scales and zero points in reorder primitive.
   * Enabled support for 4d weight scale in matmul primitive.

From a5dbb429e54a3c2c8b35df8b3dbbe41b009034bd Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:32:28 +0100
Subject: [PATCH 13/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Tao Lv <tao.a.lv@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index c4b3e64c18b..26c8d067d92 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -8,7 +8,7 @@
   * Improved performance for convolution for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
   * Improved performance of the following subgraphs with Graph API
     * Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
-    * Scaled dot-product Attention (SDPA) [with compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
+    * Scaled dot-product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
 ## AArch64-based Processors
 
 # Functionality

From 60a8ad7d370ed22b64537e42360ae57b9bdec5cd Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Thu, 23 Jan 2025 21:33:36 +0100
Subject: [PATCH 14/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Vadim Pirogov <vadim.o.pirogov@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 26c8d067d92..310139a1f02 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -5,7 +5,7 @@
 
 ## Intel Graphics Products
   * Introduced initial optimizations for GPUs based on Xe3 architecture.
-  * Improved performance for convolution for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
+  * Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
   * Improved performance of the following subgraphs with Graph API
     * Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
     * Scaled dot-product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)

From c5c9ce45029b91bc44cdc91a76e3bf5b83b5db71 Mon Sep 17 00:00:00 2001
From: Viktoriia <viktoria.gvozdeva@intel.com>
Date: Fri, 24 Jan 2025 12:12:09 +0100
Subject: [PATCH 15/29] doc: Add CPU information

---
 RELEASE_NOTES.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 310139a1f02..7ca916e6f00 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,7 +1,11 @@
 # Performance Optimizations
 ## Intel Architecture Processors
   * Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
-  * Added support and improved perfomance for fp8 matmul with bf16/fp16.
+  * Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
+  * Improved performance of convolution and matmul primitives on processors with Intel AMX support.
+  * Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
+  * Improved performance of int8 matmul primitive with fp16 output datatype.
+  * Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.
 
 ## Intel Graphics Products
   * Introduced initial optimizations for GPUs based on Xe3 architecture.

From cf5db22efcf6481b05cef27a60475f9d3659442d Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 13:19:55 +0100
Subject: [PATCH 16/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Vadim Pirogov <vadim.o.pirogov@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 7ca916e6f00..1bf797b5efa 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -30,7 +30,7 @@
 # Usability
   * With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
   * Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
-  * Improve verbose diagnostic to simplify debugging of nGEN fallbacks.
+  * Improved verbose diagnostics for Intel GPU driver compatibility issues.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
   * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
 # Validation

From 31b117019d463be5f8abaa20a35857e7ee0ee025 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:42:34 +0100
Subject: [PATCH 17/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 1bf797b5efa..dab926538b4 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -36,7 +36,7 @@
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
   * Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
-  * Extended benchdnn with support and validation for the number of partition returned from the test JSON files.
+  * Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.
 # Deprecated Functionality
 
 # Breaking Changes

From 7fdc7cfbce2fbe46d7ba0581e7e0d0a2b61d41b4 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:42:47 +0100
Subject: [PATCH 18/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index dab926538b4..b71a2247e6a 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -35,7 +35,7 @@
   * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
-  * Extended benchdnn with support for rewriting data types in the test JSON files in graph driver.
+  * Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
   * Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.
 # Deprecated Functionality
 

From f8305d74e337643d7b4c13e70157a03f1976f381 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:43:03 +0100
Subject: [PATCH 19/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index b71a2247e6a..59bf29675fd 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -12,7 +12,7 @@
   * Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
   * Improved performance of the following subgraphs with Graph API
     * Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
-    * Scaled dot-product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
+    * Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
 ## AArch64-based Processors
 
 # Functionality

From a18e6d0aa5d396528357a96ffa4447032a880918 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:43:44 +0100
Subject: [PATCH 20/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 59bf29675fd..a4674ecf7ab 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -4,7 +4,7 @@
   * Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
   * Improved performance of convolution and matmul primitives on processors with Intel AMX support.
   * Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
-  * Improved performance of int8 matmul primitive with fp16 output datatype.
+  * Improved performance of int8 matmul primitive with fp16 output data type.
   * Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.
 
 ## Intel Graphics Products

From 93a14c95b48354cfcbda7ecc7ca77212b29b30eb Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:59:18 +0100
Subject: [PATCH 21/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index a4674ecf7ab..422849583c5 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -11,7 +11,7 @@
   * Introduced initial optimizations for GPUs based on Xe3 architecture.
   * Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
   * Improved performance of the following subgraphs with Graph API
-    * Scaled dot-product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
+    * Scaled Dot-Product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
     * Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
 ## AArch64-based Processors
 

From c4c6e263f5a7a4668927659bda33b38901dd3aa9 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:59:30 +0100
Subject: [PATCH 22/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 422849583c5..402e945d215 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -3,7 +3,7 @@
   * Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
   * Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
   * Improved performance of convolution and matmul primitives on processors with Intel AMX support.
-  * Improved performance of fp8 matmul primitives with bf16 and fp16 bias datatype on processors with Intel AMX instruction set support.
+  * Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on processors with Intel AMX instruction set support.
   * Improved performance of int8 matmul primitive with fp16 output data type.
   * Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.
 

From 358438f283d5e19228762971188f247a876e4845 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:59:41 +0100
Subject: [PATCH 23/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 402e945d215..3c726a4f895 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -29,7 +29,7 @@
 		Fpmath mode API
 # Usability
   * With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
-  * Improve verbose diagnostic to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
+  * Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
   * Improved verbose diagnostics for Intel GPU driver compatibility issues.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
   * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP

From 84e671b5385cef0412f4e5ed4a8b2c2f3fa36177 Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 15:59:55 +0100
Subject: [PATCH 24/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 3c726a4f895..2283a8cc1f8 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -22,7 +22,7 @@
   * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
   * Introduced support for grouped scales and zero points in reorder primitive.
   * Enabled support for 4d weight scale in matmul primitive.
-  * Graph API: added support for Quantized and non-quantized Gated MLP pattern
+  * Graph API: Added support for quantized and non-quantized Gated MLP patterns
   * Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder.
   * [experimental] Extended microkernel API:
 		Introduced int4 quantization support.

From ca55afcc1051b9a757bb16d4650009349fda6e9a Mon Sep 17 00:00:00 2001
From: Viktoriia Gvozdeva <94697960+vgvozdeva@users.noreply.github.com>
Date: Fri, 24 Jan 2025 16:00:15 +0100
Subject: [PATCH 25/29] doc: Update RELEASE_NOTES.md

Co-authored-by: Ranu Kundu <ranu.kundu@intel.com>
---
 RELEASE_NOTES.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 2283a8cc1f8..29ae7900c58 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -28,7 +28,7 @@
 		Introduced int4 quantization support.
 		Fpmath mode API
 # Usability
-  * With SYCL runtime, memory objects on CPU engine are now reference-counted and no more need to be explicitly kept alive by user for the duration of the primitive execution. This align memory object lifetime behavior on CPU and GPU engines.
+  * With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
   * Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
   * Improved verbose diagnostics for Intel GPU driver compatibility issues.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.

From c83a60adcf460660b98a358db170a4c7164d678a Mon Sep 17 00:00:00 2001
From: "Pirogov, Vadim" <vadim.o.pirogov@intel.com>
Date: Fri, 31 Jan 2025 14:33:41 -0800
Subject: [PATCH 26/29] doc: incorporated additional Intel GPU input

---
 RELEASE_NOTES.md | 44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 29ae7900c58..fd66fe02808 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,4 +1,5 @@
 # Performance Optimizations
+
 ## Intel Architecture Processors
   * Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
   * Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
@@ -8,42 +9,55 @@
   * Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.
 
 ## Intel Graphics Products
-  * Introduced initial optimizations for GPUs based on Xe3 architecture.
+  * Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
   * Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
-  * Improved performance of the following subgraphs with Graph API
-    * Scaled Dot-Product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa)
-    * Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv)
+  * Improved performance of the following subgraphs with Graph API:
+    * Scaled Dot-Product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa).
+    * Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv).
+  * Improved performance of convolution with source zero points by pre-packing compenstation.
+  * Improved performance of backward by data convolution with strides for large filter.
+
 ## AArch64-based Processors
 
 # Functionality
+
+## Common
   * Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
-  * Enabled support for matmul primitive with grouped quantization on weight along N dimension
-  * Graph API: new [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations.
-  * Introduced support for fp16/bf16 compressed weights in fp32 matmul on Intel CPUs.
-  * Introduced support for grouped scales and zero points in reorder primitive.
-  * Enabled support for 4d weight scale in matmul primitive.
-  * Graph API: Added support for quantized and non-quantized Gated MLP patterns
-  * Introduced preliminary support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder.
-  * [experimental] Extended microkernel API:
-		Introduced int4 quantization support.
-		Fpmath mode API
+  * Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
+  * Introduced initial support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
+  * Introduced [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations in Graph API.
+  * Graph API: Added support for quantized and non-quantized Gated MLP patterns.
+
+## Intel Architecture Processors
+  * Introduced support for fp16/bf16 compressed weights in fp32 matmul.
+
+## Intel Graphics Products
+  * Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
+  * Added support for strided memory formats in convolution.
+
 # Usability
   * With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
+  * Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
+  * Reduced scratchpad usage for NCHW convolution on Intel GPUs.
   * Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
   * Improved verbose diagnostics for Intel GPU driver compatibility issues.
   * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
   * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
+
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
   * Extended benchdnn with support for rewriting data types in the test JSON files in the graph driver.
   * Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.
+
 # Deprecated Functionality
 
 # Breaking Changes
   * Updated minimal supported CMake version to 3.13 (was 2.8.12).
   * Updated minimal supported GCC version to 8.0 (was 4.8).
   * Updated minimal supported Clang version to 11.0 (was 3.0).
-  * Removed support for SYCL older than 2020
+  * Removed support for SYCL older than 2020.
+  * Enforced `fp32` accumulation mode in `fp16` matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
+
 # Thanks to these Contributors
 
 This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.

From 1ce6a5653578f513f7848663ad2273154fc77818 Mon Sep 17 00:00:00 2001
From: "Pirogov, Vadim" <vadim.o.pirogov@intel.com>
Date: Fri, 31 Jan 2025 15:35:32 -0800
Subject: [PATCH 27/29] doc: edits for structure and clarity

---
 RELEASE_NOTES.md | 62 +++++++++++++++++++++++++++++++++---------------
 1 file changed, 43 insertions(+), 19 deletions(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index fd66fe02808..349086585eb 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -1,21 +1,30 @@
 # Performance Optimizations
 
 ## Intel Architecture Processors
-  * Improved fp16/bf16 softmax performance with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
-  * Improved performance for int8 RNN primitive on processors with Intel AVX2 and Intel AVX512 instruction set support.
-  * Improved performance of convolution and matmul primitives on processors with Intel AMX support.
-  * Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on processors with Intel AMX instruction set support.
-  * Improved performance of int8 matmul primitive with fp16 output data type.
-  * Improved performance of int8 depthwise separable convolution primitive with pre-channel zero points on processors with Intel AVX2 and Intel AVX512 instruction set support.
+  * Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX support (formerly Sapphire Rapids and Granite Rapids).
+  * Improved performance of `fp8` matmul primitives with `bf16` and `fp16` bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
+  * Improved performance of `int8` RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
+  * Improved performance of `int8` depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.
+  * Improved `fp16` and `bf16` softmax performance with relaxed [accumulation mode].
+  * Improved performance of `int8` matmul primitive with `fp16` output data type.
+  * Improved performance of the following subgraphs with Graph API:
+    * [Gated Multi-Layer Perceptron (Gated MLP)].
+
+[accumulation mode]: https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode
 
 ## Intel Graphics Products
   * Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
   * Improved performance for Intel Arc Graphics for Intel Core Ultra processors (Series 2) (formerly Lunar Lake) and Intel Arc B-series discrete graphics (formerly Battlemage).
-  * Improved performance of the following subgraphs with Graph API:
-    * Scaled Dot-Product Attention (SDPA) [with implicit causal mask](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa).
-    * Scaled Dot-Product Attention (SDPA) [with int8/int4 compressed key and value](https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv).
   * Improved performance of convolution with source zero points by pre-packing compenstation.
   * Improved performance of backward by data convolution with strides for large filter.
+  * Improved performance of the following subgraphs with Graph API:
+    * Scaled Dot-Product Attention (SDPA) with [implicit causal mask].
+    * SDPA with [`int8` or `int4` compressed key and value].
+    * Gated-MLP.
+
+[implicit causal mask]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa
+[`int8` or `int4` compressed key and value]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv
+[Gated Multi-Layer Perceptron (Gated MLP)]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_gated_mlp.html#doxid-dev-guide-graph-gated-mlp
 
 ## AArch64-based Processors
 
@@ -25,24 +34,36 @@
   * Introduced support for `select` algorithm in binary primitive. The functionality is optimized for Intel CPUs.
   * Extended quantization support in matmul and reorder with grouped scales and zero-points for weights. This functionality is optimized for Intel CPUs and GPUs.
   * Introduced initial support for 4-bit floating-point data types `f4_e2m1` and `f4_e3m0` in matmul and reorder, as well as `e8m0` scales data type in matmul and reorder. This functionality is available on Intel CPUs and GPUs.
-  * Introduced [`Select`](https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html), [`GenIndex`](https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html) and [`GreaterEqual`](https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html) operations in Graph API.
-  * Graph API: Added support for quantized and non-quantized Gated MLP patterns.
+  * Introduced [`Select`], [`GenIndex`], and [`GreaterEqual`] operations in Graph API.
+
+[`Select`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_select.html
+[`GenIndex`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_genindex.html
+[`GreaterEqual`]: https://oneapi-src.github.io/oneDNN/dev_guide_op_greaterequal.html
 
 ## Intel Architecture Processors
-  * Introduced support for fp16/bf16 compressed weights in fp32 matmul.
+  * Introduced support for `fp32` matmul with `fp16` and `bf16` weights.
 
 ## Intel Graphics Products
   * Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
-  * Added support for strided memory formats in convolution.
+  * Introduced support for strided memory formats in convolution.
 
 # Usability
+
+## Common
   * With the SYCL runtime, memory objects on the CPU engine are now reference-counted and no longer need to be explicitly kept alive for the duration of the primitive execution. This aligns memory object lifetime behavior on CPU and GPU engines.
+  * Added Graph API examples for [Gated MLP] and [`int4` Gated MLP] patterns.
+
+[Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp.cpp
+[`int4` Gated MLP]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.7/examples/graph/gated_mlp_int4.cpp
+
+## Intel Architecture Processors
+  * Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for Intel CPU and Intel GPU implementations.
+  * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
+
+## Intel Processor Graphics
+  * Improved verbose diagnostics for Intel GPU driver compatibility issues.
   * Improved support of large size tensors in convolution, matmul and reduction primitives on Intel GPUs.
   * Reduced scratchpad usage for NCHW convolution on Intel GPUs.
-  * Improved verbose diagnostics to better identify issues during dispatching, primitive and kernel creation for CPU primitive and GPU (in case of OpenCL implementation) primitive implementations.
-  * Improved verbose diagnostics for Intel GPU driver compatibility issues.
-  * Enabled frame pointers support on Intel64 platforms to improve integration with profilers.
-  * Added [examples](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph) for Gated MLP and int4 Gated MLP
 
 # Validation
   * Extended benchdnn with support and validation for fp8 matmul patterns for tensor tags in RNN primitive validation.
@@ -50,15 +71,18 @@
   * Extended benchdnn with support and validation for the number of partitions returned from the test JSON files.
 
 # Deprecated Functionality
+  * Experimental [Graph Compiler] is deprecated and will be removed in future releases.
+
+[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.7/dev_guide_graph_compiler.html
 
 # Breaking Changes
   * Updated minimal supported CMake version to 3.13 (was 2.8.12).
   * Updated minimal supported GCC version to 8.0 (was 4.8).
   * Updated minimal supported Clang version to 11.0 (was 3.0).
   * Removed support for SYCL older than 2020.
-  * Enforced `fp32` accumulation mode in `fp16` matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed [accumulation mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_accumulation_mode.html#doxid-dev-guide-attributes-accumulation-mode).
+  * Enforced `fp32` accumulation mode in `fp16` matmul and inner product primitives on Intel Graphics products without Intel XMX cores. Previous behavir can be enabled with relaxed [accumulation mode].
 
-# Thanks to these Contributors
+# Thanks to our Contributors
 
 This release contains contributions from the [project core team] as well as Aditya Tewari @aditew01, Alexandra Sidorova @a-sidorova, Atharva Dubey @AD2605, Deb Taylor @deb-intel, Dmitriy Ovchinnikov @inteldimitrius, Fadi Arafeh @fadara01, Hengyu Meng @airMeng, @hmaciak, John Osorio @kala855, Marek Michalowski @michalowski-arm, Michael Froelich @MichaelFroelich, Michał Górny @mgorny, Nikhil Sharma @nikhilfujitsu, Permanence AI Coder @Permanence-AI-Coder, @raistefintel, Ravi Pushkar @rpushkarr, Renato Barros Arantes @renato-arantes, Romain Biessy @Rbiessy, Ryo Suzuki @Ryo-not-rio, @Shreyas-fuj, Varad Ahirwadkar @varad-ahirwadkar, @vishwascm, and Ye Tao @taoye9. We would also like to thank everyone who asked questions and reported issues.
 

From 0619d9c672096a9072f1b0be82c031061e4a3b02 Mon Sep 17 00:00:00 2001
From: Viktoriia <viktoria.gvozdeva@intel.com>
Date: Wed, 5 Feb 2025 10:01:05 +0100
Subject: [PATCH 28/29] doc: add more information

---
 RELEASE_NOTES.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 349086585eb..03216b2bab8 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -2,6 +2,7 @@
 
 ## Intel Architecture Processors
   * Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX support (formerly Sapphire Rapids and Granite Rapids).
+  * Improved performance of `int8` and `fp32` forward convolution primitive on processors with Intel AVX2 instruction set support.
   * Improved performance of `fp8` matmul primitives with `bf16` and `fp16` bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
   * Improved performance of `int8` RNN primitive on processors with Intel AVX2 and Intel AVX-512 instruction set support.
   * Improved performance of `int8` depthwise separable convolution primitive with per-channel zero points on processors with Intel AVX2 and Intel AVX-512 instruction set support.

From c3676ad3e162a5eb8a1f1af2fb248ad77e41c541 Mon Sep 17 00:00:00 2001
From: "Pirogov, Vadim" <vadim.o.pirogov@intel.com>
Date: Fri, 7 Feb 2025 14:43:49 -0800
Subject: [PATCH 29/29] doc: incorporated generic GPU changes

---
 RELEASE_NOTES.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
index 03216b2bab8..6905b12e638 100644
--- a/RELEASE_NOTES.md
+++ b/RELEASE_NOTES.md
@@ -21,13 +21,14 @@
   * Improved performance of the following subgraphs with Graph API:
     * Scaled Dot-Product Attention (SDPA) with [implicit causal mask].
     * SDPA with [`int8` or `int4` compressed key and value].
-    * Gated-MLP.
+    * Gated MLP.
 
 [implicit causal mask]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa.html#doxid-dev-guide-graph-sdpa
 [`int8` or `int4` compressed key and value]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_sdpa_compressed_kv.html#doxid-dev-guide-graph-sdpa-compressed-kv
 [Gated Multi-Layer Perceptron (Gated MLP)]: https://oneapi-src.github.io/oneDNN/dev_guide_graph_gated_mlp.html#doxid-dev-guide-graph-gated-mlp
 
-## AArch64-based Processors
+## NVIDIA GPUs
+  * Improved matmul performance using cuBLASLt-based implementation.
 
 # Functionality
 
@@ -48,6 +49,10 @@
   * Introduced stochastic rounding support for convolution, matmul and reorder based on Philox counter-based random number generator.
   * Introduced support for strided memory formats in convolution.
 
+## Generic GPU vendor
+  * Introduced support for reduction primitive.
+  * Introduced support for inner product primitive forward propagation.
+
 # Usability
 
 ## Common