Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Mask Patch Failure and Quantization Issues in Latest transformers Versions #368

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Aintor
Copy link

@Aintor Aintor commented Nov 10, 2024

In the latest versions of the transformers library (specifically above version 4.34.1), the _make_causal_mask function in the modeling_clip module has been removed. Previously, code that utilized this function looked like this:

causal_attention_mask = _make_causal_mask(input_shape, hidden_states.dtype, device=hidden_states.device)

However, with recent updates, this call has been replaced by:

causal_attention_mask = _create_4d_causal_attention_mask(input_shape, hidden_states.dtype, device=hidden_states.device)

You can see more in this thread huggingface/transformers#28305.

This change disrupts the functionality in python_coreml_stable_diffusion/torch2coreml.py, where the following line:

modeling_clip._make_causal_mask = patched_make_causal_mask

can no longer patch the _make_causal_mask function as expected, resulting in the following error during quantization:

ValueError: Input X contains infinity or a value too large for dtype('float64').

See related issues: #331, #303, #325, #246

This PR addresses the issue by adding a monkey patch to modeling_clip for the _create_4d_causal_attention_mask function, thereby fixing the mask patch failure and restoring compatibility with the --quantize-nbits feature in the latest transformers versions. It also retains the original function override to maintain support for older transformers versions.

@Aintor
Copy link
Author

Aintor commented Nov 10, 2024

By the way, I have tested this PR in a new conda environment initialized with pip install -e ., and the --quantize-nbits flag works as expected.

@Aintor Aintor requested a review from aseemw November 12, 2024 02:37
@Aintor
Copy link
Author

Aintor commented Nov 12, 2024

@aseemw I noticed PR #316 mentions transformers version 4.29.2, where it uses the following line:
causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype, device=hidden_states.device)
Do you think I should add support for versions below 4.29.2 as well?

@Aintor
Copy link
Author

Aintor commented Nov 12, 2024

@aseemw I noticed PR #316 mentions transformers version 4.29.2, where it uses the following line: causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype, device=hidden_states.device) Do you think I should add support for versions below 4.29.2 as well?

However, I don’t recommend adding support for versions below 4.30.0, as the mask functionality has changed frequently in the modeling_clip module since its creation and remained unstable until version 4.30.0. I believe it would be best to discontinue support for versions earlier than 4.30.0.

@aseemw aseemw requested a review from TobyRoseman November 12, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants