-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Experimental] Add Kleidi i8mm gemm kernels #1295
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1295
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit c0ce311 with merge base 26648c2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -0,0 +1,120 @@ | |||
// Copyright (c) Meta Platforms, Inc. and affiliates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this entire file codegenable? not for now but just asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good idea, not that much work TBH.
...al/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.h
Show resolved
Hide resolved
...al/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.h
Show resolved
Hide resolved
...al/kernels/cpu/aarch64/kleidi/kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mainly some nits and few questions
Add kernel level tests, with basic cross compilation support. Tested with S24 + r26c ``` [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs_32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs_32 (0 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.large_k_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.large_k_n_gs32 (79 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.even_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.even_n_gs32 (28 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.clamp_k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.m_clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.m_clamp_k_eq_gs128 (5 ms) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm (121 ms total) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs_32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs_32 (0 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.large_k_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.large_k_n_gs32 (79 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.even_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.even_n_gs32 (28 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.clamp_k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.m_clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.m_clamp_k_eq_gs128 (5 ms) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm (121 ms total) ```
5f19b69
to
8504177
Compare
8504177
to
c0ce311
Compare
Removed operator level test, landing this PR without those. I will debug and add them in another PR as originally planned. Thanks. |
@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
* Update git ignore * [experimental] Add Kleidi compile def at the top level * [Experimental] Add Kleidi i8mm gemm kernels Add kernel level tests, with basic cross compilation support. Tested with S24 + r26c ``` [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs_32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs_32 (0 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.large_k_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.large_k_n_gs32 (79 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.even_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.even_n_gs32 (28 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.clamp_k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.m_clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm.m_clamp_k_eq_gs128 (5 ms) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_8x4x32_neon_i8mm (121 ms total) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs_32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs_32 (0 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.large_k_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.large_k_n_gs32 (79 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.even_n_gs32 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.even_n_gs32 (28 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.clamp_k_eq_gs128 (3 ms) [ RUN ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.m_clamp_k_eq_gs128 [ OK ] test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm.m_clamp_k_eq_gs128 (5 ms) [----------] 6 tests from test_kai_matmul_clamp_f32_qai8dxp4x8_qsi4c32p8x8_4x8x32_neon_i8mm (121 ms total) ``` * [Exeprimental] Kleidi: rename arg name for packing functions * [Experimental] Change kernel cmake_out dir to avoid conflict
* Revert Generate Behavior for non-Flamingo Models * Simplify
Add kernel level tests, with basic cross compilation support.
Tested with S24 + r26c