Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise an error when OVQuantizer is invoked on an compressed model #1122

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

l-bat
Copy link
Contributor

@l-bat l-bat commented Jan 20, 2025

What does this PR do?

For models exceeding 1 billion parameters, optimum-intel applies weight compression by default during the from_pretrained method call. This behavior impacts the functionality of subsequent quantization algorithms because patterns cannot be matched properly, leading to incorrect results (ref: 160308).

This PR introduces a mechanism to detect if the model was compressed by checking runtime information. If a user attempts to quantize a compressed model, an error is raised to prevent incorrect behavior.

Fixes # (issue)
ticket: 160760

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@l-bat
Copy link
Contributor Author

l-bat commented Jan 20, 2025

@AlexKoff88, @nikita-savelyevv please take a look

@@ -315,6 +315,31 @@ def quantize(
else:
raise TypeError(f"Unsupported model type: {type(self.model)}")

def _check_model_state(self, sub_model_names: List[str] = None):
message_template = (
"Couldn't apply optimization to the model because it was already compressed with config: {}. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Couldn't apply optimization to the model because it was already compressed with config: {}. "
"Cannot apply optimization to the model because it was already optimized with the following config: {}. "

Comment on lines 471 to 445
elif isinstance(self.model, OVModelForVisualCausalLM):
language_model = self.model.language_model
_weight_only_quantization(language_model.model, quantization_config, calibration_dataset, **kwargs)
sub_model_names = ["vision_embeddings", "text_embeddings"] + self.model.additional_parts
self._check_model_state(sub_model_names + ["language_model"])
_weight_only_quantization(language_model.model, quantization_config, calibration_dataset, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we now should obtain the model type and submodel names once in the beginning and then reuse these variables later to choose which type of dataset to collect and how to apply optimization. With this, we could also do the _check_model_state only once.

I can do this myself in a separate follow-up PR though.

Comment on lines 480 to 488
elif isinstance(self.model, OVModelForSeq2SeqLM):
sub_model_names = ["encoder", "decoder"]
if self.model.decoder_with_past is not None:
sub_model_names.append("decoder_with_past")
self._check_model_state(sub_model_names)
sub_models = [getattr(self.model, name) for name in sub_model_names]
for sub_model in sub_models:
_weight_only_quantization(sub_model, quantization_config, **kwargs)
self.model.clear_requests()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort, but I believe compression support for seq2seq models should be added separately because we also need to write a dataset collection logic for this type of model and add some tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticket: 160958

@@ -315,6 +315,31 @@ def quantize(
else:
raise TypeError(f"Unsupported model type: {type(self.model)}")

def _check_model_state(self, sub_model_names: List[str] = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the name seems to be too general. How about this?

Suggested change
def _check_model_state(self, sub_model_names: List[str] = None):
def _verify_not_optimized(self, sub_model_names: List[str] = None):

Could you please also move it lower to the end of class methods declaration?

Comment on lines 715 to 716
with pytest.raises(RuntimeError):
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also specify a message with which the error should be raised?

Comment on lines 711 to 713
if isinstance(model, OVModelOpenCLIPForZeroShotImageClassification):
with pytest.raises(TypeError):
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it different for OVModelOpenCLIPForZeroShotImageClassification?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantization is not supported for OVModelOpenCLIPForZeroShotImageClassification. I'll skip it in test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ticket: 161043

@@ -698,6 +698,23 @@ def test_ovmodel_load_with_compressed_weights(self, model_cls, model_type, trust
_, num_weight_nodes = get_num_quantized_nodes(model)
self.assertEqual(expected_ov_int8[i], num_weight_nodes["int8"])

@parameterized.expand(SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION)
def test_raise_error_WC_over_WC(self, model_cls, model_type, trust_remote_code):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of models inside SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION is quite big. My concern is that test time will increase significantly with addition of this test.

We could limit the model scope. Or even better add this logic to the end of the existing test: test_ovmodel_load_with_compressed_weights which already does int8 compression. Running a quantization that will instantly fail over this should not increase testing time much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test time for all cases is 0:03:15 on my machine. I can incorporate this into test_ovmodel_load_with_compressed_weights to minimize the impact on testing time.

def _check_model_state(self, sub_model_names: List[str] = None):
message_template = (
"Couldn't apply optimization to the model because it was already compressed with config: {}. "
"To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "
"To avoid this issue, check that you set load_in_8bit=False or not using quantization_config at export in the .from_pretrained(),"

message_template = (
"Couldn't apply optimization to the model because it was already compressed with config: {}. "
"To avoid this issue, set load_in_8bit=False in the from_pretrained method when using the optimum-intel API, "
"or explicitly specify the desired weight format using --weight_format fp16/fp32 for CLI."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"or explicitly specify the desired weight format using --weight_format fp16/fp32 for CLI."
"or explicitly specify weight format with --weight_format fp16/fp32 when using CLI."

@@ -698,6 +698,23 @@ def test_ovmodel_load_with_compressed_weights(self, model_cls, model_type, trust
_, num_weight_nodes = get_num_quantized_nodes(model)
self.assertEqual(expected_ov_int8[i], num_weight_nodes["int8"])

@parameterized.expand(SUPPORTED_ARCHITECTURES_WITH_AUTO_COMPRESSION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check that the exception is thrown if we compress quantized or compressed model and if we quantize quantized or compressed model.

@l-bat l-bat force-pushed the lt/handle_quant_after_wc branch from 0420051 to 92facae Compare January 23, 2025 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants