-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat]: Add support for kleidiai quantization schemes #1447
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1447
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @ng-05! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Hello @jerryzh168 ,
How should we take this input from user regarding quantization schemes. Groupsize parameter can not server the purpose as channelsize will change for diff matmuls in a model? Currently I am using "scheme" parameter to differentiate between the two. |
yeah, you can use https://github.com/pytorch/ao/blob/main/torchao/quantization/granularity.py: PerGroup and PerAxis(axis=0) (assuming channel dimension is 0), examples: ao/torchao/quantization/quant_api.py Line 1069 in 070345d
ao/tutorials/calibration_flow/static_quant.py Line 168 in 070345d
|
Thanks for the inputs @jerryzh168. I have initial change ready which extends int4_weight_only quantizer. The 4 bit KleidiAI kernels quantizes the weight in torchao and input to 8 bit within the kernel itself instead of quantizing the input in the torchao the way int8_dynamic_activation_int4_weight does. Currently neither int4_weight_only nor int8_dynamic_activation_int4_weight fully aligns with the way kelidiai 4 bit kernels are working. I feel int4_weight_only is closest to what we want to do, what are your thoughts on this? |
yeah int4_weight_only means no input quantization, I think it aligns better with we also have ao/torchao/experimental/quant_api.py Line 485 in 4738377
You can also check out: #995 |
Description: 1. Allow Int4WeightOnlyQuantizer to work with channelwise and groupwise symmetric quantization schemes 2. KleidiAI supports channelwise and 32 groupwise quantized matmul kernels Signed-off-by: Nikhil Gupta <[email protected]>
…WeightLayout Signed-off-by: Nikhil Gupta <[email protected]>
738d7f2
to
358d6b4
Compare
This reverts commit 2a18e60.
Hello @jerryzh168 , I am planning to migrate Can you please review this change, specially the change the in I am also not sure if int8_dynamic_activation_intx_weight* quantizer can be accessed by torchchat currently? Do you have an example how torchchat can pass args like granularity, mapping_type from torchchat cli to torchao ? |
target: Target | ||
|
||
# Allow bias access via layout | ||
bias: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layout is more of a "type" actually, why is bias Tensor passed here?
the corresponding "storage" is TensorImpl
"""Helper function to apply the constructor that quantizes the weight Tensor (with additional kwargs) | ||
to the weight of linear module | ||
""" | ||
|
||
def insert_subclass(lin): | ||
requires_grad = allow_requires_grad and lin.weight.requires_grad | ||
args = [lin.weight] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel putting optional args in kwargs might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me overall, can you add some tests?
I don't think we need to expose these fine grained args to torchchat cli, we just need these high level args like: https://github.com/pytorch/torchchat/blob/main/torchchat/quant_config/mobile.json we are also working on migrating torchchat to use torchao quant api btw |
@@ -100,6 +110,12 @@ def _pack_weights_native( | |||
torch.empty(0, group_size, dtype=torch.int8), | |||
] | |||
|
|||
if TORCH_VERSION_AT_LEAST_2_6 and layout.target == Target.ATEN: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If torch version is not 2.6 but layout.target == aten, then what happens? Should you just assert that it is not supported?
), "Target.ATEN requires torch >= 2.6.0" | ||
# aten supports bias for kleidiAI but not for default fallback op | ||
if not torch.backends.kleidiai.is_available(): | ||
print("TODO bias == None") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert bias == None,
return torch.ops.aten._dyn_quant_matmul_4bit( | ||
input_tensor, packed_weight, group_size, k_, n) | ||
|
||
if input_tensor.dim() == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have this requirement?
_intx_granularity = Union[PerGroup, PerRow] | ||
|
||
|
||
def int8_dynamic_activation_intx_weight_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently refactoring int8_dynamic_activation_intx_weight quantizer to use layout instead of target for the packing format: #1553. I think this should provide more flexibility longterm.
@@ -153,7 +169,7 @@ def get_layout(self) -> Layout: | |||
def get_plain( | |||
self, | |||
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[torch.Tensor]]: | |||
if self.get_layout().target == Target.FALLBACK: | |||
if self.get_layout().target == Target.FALLBACK or self.get_layout().target == Target.ATEN: | |||
return self.packed_weight, self.scale, self.zero_point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC when using Target.ATEN, self.packed_weight is not the int_data, so I'm not sure get_plain is correct here?
_intx_granularity = Union[PerGroup, PerRow] | ||
|
||
|
||
def int8_dynamic_activation_intx_weight_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a significant diffrence between this and int8_dynamic_activation_intx_weight?
torchchat does not currently use int8_dynamic_activation_intx_weight, but instead a submodule swap API here: https://github.com/pytorch/ao/blob/main/torchao/experimental/quant_api.py#L438 We will be switching torchchat to use int8_dynamic_activation_intx_weight instead, but I first need to land some changes for perf/clarity: #1553 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this quant API now connects kernels we landed in aten with quant API. If the kernels you guys landed in aten are actually new ops, unlike int4pack_mm and friends, then why did we land them there in the first place. In order to reach those kernels you need ao dep anyway? (@digantdesai I know you tagged me on that PR but i never really deep dived into that so maybe you have context here)
Besides taht i have a couple of questions.
- In the current form it is only making aten op you guys added available via tensor subclass api, so what happens to say torch.compile (maybe this works?) or AOTI usecase?
- I would also like to see if we can leverage this op in executorch, for which integration into AO would have been a better choice compared to this being aten op
- If kleidi's op performs better than whats in this repo (and note that @digantdesai has actually integrated some of the kleidi ops that I guess you guys are aware of), then can we just use that op directly or have a path to kleidi's impl for the cpu ops that exist under experimental/ops?
Any specific reason why use subclass API instead of module swap? |
Description:
Needs : pytorch/pytorch#134124