[Feat]: Add support for kleidiai quantization schemes #1447

ng-05 · 2024-12-19T10:44:27Z

Description:

Allow Int4WeightOnlyQuantizer to work with channelwise and groupwise symmetric quantization schemes
KleidiAI supports channelwise and 32 groupwise quantized matmul kernels

Needs : pytorch/pytorch#134124

pytorch-bot · 2024-12-19T10:44:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1447

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-12-19T10:44:33Z

Hi @ng-05!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

torchao/quantization/GPTQ.py

ng-05 · 2025-01-08T12:14:06Z

Hello @jerryzh168 ,
We want to support two diff type of int4 schemes.

symmetric_groupwise -> groupsize [ 32, 64, 128 etc ]
symmetric_channelwise -> groupsize is equal to channelsize of the matmul weights

How should we take this input from user regarding quantization schemes. Groupsize parameter can not server the purpose as channelsize will change for diff matmuls in a model?

Currently I am using "scheme" parameter to differentiate between the two.
aarch64_cpu_channelwise.json
aarch64_cpu_groupwise.json

jerryzh168 · 2025-01-08T17:44:42Z

How should we take this input from user regarding quantization schemes. Groupsize parameter can not server the purpose as channelsize will change for diff matmuls in a model?

yeah, you can use https://github.com/pytorch/ao/blob/main/torchao/quantization/granularity.py: PerGroup and PerAxis(axis=0) (assuming channel dimension is 0), examples:

ao/torchao/quantization/quant_api.py

Line 1069 in 070345d

granularity: Optional[

,

ao/tutorials/calibration_flow/static_quant.py

Line 168 in 070345d

    
           weight_obs = AffineQuantizedMinMaxObserver(mapping_type, target_dtype, granularity_type=PerAxis(axis=0), eps=torch.finfo(torch.float32).eps, scale_dtype=torch.float32, zero_point_dtype=torch.float32)

ng-05 · 2025-01-09T02:00:37Z

How should we take this input from user regarding quantization schemes. Groupsize parameter can not server the purpose as channelsize will change for diff matmuls in a model?

yeah, you can use https://github.com/pytorch/ao/blob/main/torchao/quantization/granularity.py: PerGroup and PerAxis(axis=0) (assuming channel dimension is 0), examples:

ao/torchao/quantization/quant_api.py

Line 1069 in 070345d

granularity: Optional[

,

ao/tutorials/calibration_flow/static_quant.py

Line 168 in 070345d

weight_obs = AffineQuantizedMinMaxObserver(mapping_type, target_dtype, granularity_type=PerAxis(axis=0), eps=torch.finfo(torch.float32).eps, scale_dtype=torch.float32, zero_point_dtype=torch.float32)

Thanks for the inputs @jerryzh168.

I have initial change ready which extends int4_weight_only quantizer.

The 4 bit KleidiAI kernels quantizes the weight in torchao and input to 8 bit within the kernel itself instead of quantizing the input in the torchao the way int8_dynamic_activation_int4_weight does.
For this reason I am extending the int4_weight_only api. I am slightly confused if the intention of this api is to convey NO input quantisation to user?

Currently neither int4_weight_only nor int8_dynamic_activation_int4_weight fully aligns with the way kelidiai 4 bit kernels are working.

I feel int4_weight_only is closest to what we want to do, what are your thoughts on this?

jerryzh168 · 2025-01-09T02:33:56Z

I feel int4_weight_only is closest to what we want to do, what are your thoughts on this?

yeah int4_weight_only means no input quantization, I think it aligns better with int8_dynamic_activation_int4_weight, you can use a different layout and customize the logic for input quantization.

we also have

ao/torchao/experimental/quant_api.py

Line 485 in 4738377

def int8_dynamic_activation_intx_weight(

that is the same as your use case. there is some ongoing refactors/updates there as well right now

You can also check out: #995

Description: 1. Allow Int4WeightOnlyQuantizer to work with channelwise and groupwise symmetric quantization schemes 2. KleidiAI supports channelwise and 32 groupwise quantized matmul kernels Signed-off-by: Nikhil Gupta <[email protected]>

…WeightLayout Signed-off-by: Nikhil Gupta <[email protected]>

This reverts commit 2a18e60.

ng-05 · 2025-01-11T01:43:08Z

Hello @jerryzh168 , I am planning to migrate int8_dynamic_activation_intx_weight api to int8_dynamic_activation_intx_weight_v2.
For now I have kept the API separate for review and testing.

Can you please review this change, specially the change the in _get_linear_subclass_inserter which allow bias propagation. The bias needed by torch.ops.aten._dyn_quant_pack_4bit_weight.

I am also not sure if int8_dynamic_activation_intx_weight* quantizer can be accessed by torchchat currently? Do you have an example how torchchat can pass args like granularity, mapping_type from torchchat cli to torchao ?

jerryzh168 · 2025-01-11T02:31:43Z

torchao/experimental/_linear_8bit_act_xbit_weight_layout.py

    target: Target

+    # Allow bias access via layout
+    bias: Optional[torch.Tensor] = None


layout is more of a "type" actually, why is bias Tensor passed here?

the corresponding "storage" is TensorImpl

jerryzh168 · 2025-01-11T02:34:53Z

torchao/quantization/quant_api.py

    """Helper function to apply the constructor that quantizes the weight Tensor (with additional kwargs)
    to the weight of linear module
    """

    def insert_subclass(lin):
        requires_grad = allow_requires_grad and lin.weight.requires_grad
+        args = [lin.weight]


nit: I feel putting optional args in kwargs might be better

jerryzh168

looks good to me overall, can you add some tests?

jerryzh168 · 2025-01-11T02:37:36Z

I am also not sure if int8_dynamic_activation_intx_weight* quantizer can be accessed by torchchat currently? Do you have an example how torchchat can pass args like granularity, mapping_type from torchchat cli to torchao ?

I don't think we need to expose these fine grained args to torchchat cli, we just need these high level args like: https://github.com/pytorch/torchchat/blob/main/torchchat/quant_config/mobile.json

we are also working on migrating torchchat to use torchao quant api btw

kimishpatel · 2025-01-11T03:57:34Z

torchao/experimental/_linear_8bit_act_xbit_weight_layout.py

@@ -100,6 +110,12 @@ def _pack_weights_native(
            torch.empty(0, group_size, dtype=torch.int8),
        ]

+    if TORCH_VERSION_AT_LEAST_2_6 and layout.target == Target.ATEN:


If torch version is not 2.6 but layout.target == aten, then what happens? Should you just assert that it is not supported?

kimishpatel · 2025-01-11T04:06:11Z

torchao/experimental/_linear_8bit_act_xbit_weight_layout.py

+    ), "Target.ATEN requires torch >= 2.6.0"
+    # aten supports bias for kleidiAI but not for default fallback op
+    if not torch.backends.kleidiai.is_available():
+        print("TODO bias == None")


assert bias == None,

kimishpatel · 2025-01-11T04:06:28Z

torchao/experimental/_linear_8bit_act_xbit_weight_layout.py

+        return torch.ops.aten._dyn_quant_matmul_4bit(
+            input_tensor, packed_weight, group_size, k_, n)
+
+    if input_tensor.dim() == 2:


Do you have this requirement?

metascroy · 2025-01-11T04:15:04Z

torchao/experimental/quant_api.py

+_intx_granularity = Union[PerGroup, PerRow]
+
+
+def int8_dynamic_activation_intx_weight_v2(


I'm currently refactoring int8_dynamic_activation_intx_weight quantizer to use layout instead of target for the packing format: #1553. I think this should provide more flexibility longterm.

metascroy · 2025-01-11T04:18:20Z

torchao/experimental/_linear_8bit_act_xbit_weight_layout.py

@@ -153,7 +169,7 @@ def get_layout(self) -> Layout:
    def get_plain(
        self,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor]]:
-        if self.get_layout().target == Target.FALLBACK:
+        if self.get_layout().target == Target.FALLBACK or self.get_layout().target == Target.ATEN:
            return self.packed_weight, self.scale, self.zero_point


IIUC when using Target.ATEN, self.packed_weight is not the int_data, so I'm not sure get_plain is correct here?

kimishpatel · 2025-01-11T04:22:39Z

torchao/experimental/quant_api.py

+_intx_granularity = Union[PerGroup, PerRow]
+
+
+def int8_dynamic_activation_intx_weight_v2(


is there a significant diffrence between this and int8_dynamic_activation_intx_weight?

metascroy · 2025-01-11T04:24:33Z

Hello @jerryzh168 , I am planning to migrate int8_dynamic_activation_intx_weight api to int8_dynamic_activation_intx_weight_v2. For now I have kept the API separate for review and testing.

Can you please review this change, specially the change the in _get_linear_subclass_inserter which allow bias propagation. The bias needed by torch.ops.aten._dyn_quant_pack_4bit_weight.

I am also not sure if int8_dynamic_activation_intx_weight* quantizer can be accessed by torchchat currently? Do you have an example how torchchat can pass args like granularity, mapping_type from torchchat cli to torchao ?

torchchat does not currently use int8_dynamic_activation_intx_weight, but instead a submodule swap API here: https://github.com/pytorch/ao/blob/main/torchao/experimental/quant_api.py#L438

We will be switching torchchat to use int8_dynamic_activation_intx_weight instead, but I first need to land some changes for perf/clarity: #1553

kimishpatel

I understand that this quant API now connects kernels we landed in aten with quant API. If the kernels you guys landed in aten are actually new ops, unlike int4pack_mm and friends, then why did we land them there in the first place. In order to reach those kernels you need ao dep anyway? (@digantdesai I know you tagged me on that PR but i never really deep dived into that so maybe you have context here)

Besides taht i have a couple of questions.

In the current form it is only making aten op you guys added available via tensor subclass api, so what happens to say torch.compile (maybe this works?) or AOTI usecase?
I would also like to see if we can leverage this op in executorch, for which integration into AO would have been a better choice compared to this being aten op
If kleidi's op performs better than whats in this repo (and note that @digantdesai has actually integrated some of the kleidi ops that I guess you guys are aware of), then can we just use that op directly or have a path to kleidi's impl for the cpu ops that exist under experimental/ops?

kimishpatel · 2025-01-11T04:28:57Z

Hello @jerryzh168 , I am planning to migrate int8_dynamic_activation_intx_weight api to int8_dynamic_activation_intx_weight_v2. For now I have kept the API separate for review and testing.
Can you please review this change, specially the change the in _get_linear_subclass_inserter which allow bias propagation. The bias needed by torch.ops.aten._dyn_quant_pack_4bit_weight.
I am also not sure if int8_dynamic_activation_intx_weight* quantizer can be accessed by torchchat currently? Do you have an example how torchchat can pass args like granularity, mapping_type from torchchat cli to torchao ?

torchchat does not currently use int8_dynamic_activation_intx_weight, but instead a submodule swap API here: main/torchao/experimental/quant_api.py#L438

We will be switching torchchat to use int8_dynamic_activation_intx_weight instead, but I first need to land some changes for perf/clarity: #1553

Any specific reason why use subclass API instead of module swap?

ng-05 marked this pull request as draft December 19, 2024 10:44

jerryzh168 reviewed Dec 20, 2024

View reviewed changes

torchao/quantization/GPTQ.py Outdated Show resolved Hide resolved

ng-05 added 2 commits January 11, 2025 01:32

[Feat]: Enable dyn_quant_pack_4bit aten kernels via Linear8BitActXBit…

358d6b4

…WeightLayout Signed-off-by: Nikhil Gupta <[email protected]>

ng-05 force-pushed the kai_integration branch from 738d7f2 to 358d6b4 Compare January 11, 2025 01:36

Revert "[Feat]: Add support for kleidiai quantization schemes"

f470466

This reverts commit 2a18e60.

jerryzh168 reviewed Jan 11, 2025

View reviewed changes

jerryzh168 approved these changes Jan 11, 2025

View reviewed changes

jerryzh168 requested review from kimishpatel, metascroy and digantdesai January 11, 2025 02:42

kimishpatel reviewed Jan 11, 2025

View reviewed changes

metascroy reviewed Jan 11, 2025

View reviewed changes

kimishpatel reviewed Jan 11, 2025

View reviewed changes

kimishpatel requested changes Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Add support for kleidiai quantization schemes #1447

[Feat]: Add support for kleidiai quantization schemes #1447

ng-05 commented Dec 19, 2024 •

edited

Loading

pytorch-bot bot commented Dec 19, 2024

facebook-github-bot commented Dec 19, 2024

ng-05 commented Jan 8, 2025

jerryzh168 commented Jan 8, 2025 •

edited

Loading

ng-05 commented Jan 9, 2025

jerryzh168 commented Jan 9, 2025 •

edited

Loading

ng-05 commented Jan 11, 2025 •

edited

Loading

jerryzh168 Jan 11, 2025 •

edited

Loading

jerryzh168 Jan 11, 2025

jerryzh168 left a comment

jerryzh168 commented Jan 11, 2025

kimishpatel Jan 11, 2025

kimishpatel Jan 11, 2025

kimishpatel Jan 11, 2025

metascroy Jan 11, 2025

metascroy Jan 11, 2025

kimishpatel Jan 11, 2025

metascroy commented Jan 11, 2025

kimishpatel left a comment

kimishpatel commented Jan 11, 2025

		_intx_granularity = Union[PerGroup, PerRow]


		def int8_dynamic_activation_intx_weight_v2(

[Feat]: Add support for kleidiai quantization schemes #1447

Are you sure you want to change the base?

[Feat]: Add support for kleidiai quantization schemes #1447

Conversation

ng-05 commented Dec 19, 2024 • edited Loading

pytorch-bot bot commented Dec 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1447

facebook-github-bot commented Dec 19, 2024

Action Required

Process

ng-05 commented Jan 8, 2025

jerryzh168 commented Jan 8, 2025 • edited Loading

ng-05 commented Jan 9, 2025

jerryzh168 commented Jan 9, 2025 • edited Loading

ng-05 commented Jan 11, 2025 • edited Loading

jerryzh168 Jan 11, 2025 • edited Loading

Choose a reason for hiding this comment

jerryzh168 Jan 11, 2025

Choose a reason for hiding this comment

jerryzh168 left a comment

Choose a reason for hiding this comment

jerryzh168 commented Jan 11, 2025

kimishpatel Jan 11, 2025

Choose a reason for hiding this comment

kimishpatel Jan 11, 2025

Choose a reason for hiding this comment

kimishpatel Jan 11, 2025

Choose a reason for hiding this comment

metascroy Jan 11, 2025

Choose a reason for hiding this comment

metascroy Jan 11, 2025

Choose a reason for hiding this comment

kimishpatel Jan 11, 2025

Choose a reason for hiding this comment

metascroy commented Jan 11, 2025

kimishpatel left a comment

Choose a reason for hiding this comment

kimishpatel commented Jan 11, 2025

ng-05 commented Dec 19, 2024 •

edited

Loading

jerryzh168 commented Jan 8, 2025 •

edited

Loading

jerryzh168 commented Jan 9, 2025 •

edited

Loading

ng-05 commented Jan 11, 2025 •

edited

Loading

jerryzh168 Jan 11, 2025 •

edited

Loading