Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composability with sparse and quantization compressors #948

Merged
merged 21 commits into from
Jan 22, 2025
Merged

Conversation

rahul-tuli
Copy link
Collaborator

@rahul-tuli rahul-tuli commented Dec 2, 2024

This PR enables accomplishes the following:

  • Increases the sparsity threshold to 50%
  • Allow sparse + quantized compression-decompression on llm-compressor side
  • Adds a test for sparse+quantized compression-decompression

Needs: neuralmagic/compressed-tensors#241

Choices of Compressor's for different cases:

Quantization Sparsity Quant Compressor Format Sparsity Compressor Format
W8A8 - int None int_quantized Dense
W8A8 - float None float_quantized Dense
W4A16 - int None pack_quantized Dense
W8A16 - int None pack_quantized Dense
W8A16 - float None naive_quantized Dense
W8A8 - int 2:4 int_quantized Sparse24
W8A8 - float 2:4 float_quantized Sparse24
W4A16 - int 2:4 marlin_24 Dense
W8A16 - int 2:4 marlin_24 Dense
W8A16 - float 2:4 naive_quantized Dense

Explanation

  • Quantization Format: Specifies the quantization compression format used.
  • Compressed Flag: Boolean flag indicating whether the model should be compressed on disk.
  • Sparsity Structure: Type of sparsity structure inferred. Can be structured or unstructured.
  • Sparse Compression Format: Resulting compression format based on the quantization_format, compressed_flag, and sparsity structure.

Notes

  1. If the global sparsity (global_sparsity) is less than 0.05, no compression configuration is returned.
  2. For CompressionFormat.marlin_24, the compression format defaults to dense regardless of sparsity structure.
  3. For other quantization formats:
    • If compressed_flag is True and the sparsity structure is TWO_FOUR, the compression format is sparse_24_bitmask. (This is the only sparse compressor supported in vllm)
    • Otherwise, the compression format defaults to dense.
  4. The SparsityThreshold (default 0.5) is used to determine targets and ignores in the model by evaluating parameter sparsity.

Additional Information

  • Targets: Parameters selected for sparsification based on the SparsityThreshold.
  • Ignores: Parameters excluded from sparsification based on the SparsityThreshold.
  • These are inferred using the infer_sparse_targets_and_ignores function, considering the sparsity structure and threshold.

Copy link

github-actions bot commented Dec 2, 2024

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

@rahul-tuli rahul-tuli force-pushed the composability-v2 branch 2 times, most recently from 9249158 to afc0b5f Compare December 3, 2024 06:16
@rahul-tuli rahul-tuli changed the title [ DRAFT ] Composability with sparse and quantization compressors Composability with sparse and quantization compressors Dec 3, 2024
horheynm
horheynm previously approved these changes Dec 3, 2024
@horheynm
Copy link
Collaborator

horheynm commented Dec 3, 2024

verified decompression works for sparse and quantized model

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I have a few follow-up questions that I've noted. We can discuss during work session

@rahul-tuli
Copy link
Collaborator Author

Add case in the description where save_compressed is True but unstructured sparsity; which quantized compressor.

@dsikka
Copy link
Collaborator

dsikka commented Jan 21, 2025

W4A16 - float

W4A16 - float is not a relevant scheme
We're also missing W8A16 - no sparsity and W8A16 + 2:4 Sparsity


Please update the table

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran a ton of cases and everything seems well covered and integrated well with vLLM.

Few small comments

@rahul-tuli
Copy link
Collaborator Author

W4A16 - float

W4A16 - float is not a relevant scheme We're also missing W8A16 - no sparsity and W8A16 + 2:4 Sparsity

Please update the table

Done!

Add Table in docstring
Add test for compressor inference
dsikka
dsikka previously approved these changes Jan 21, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@dsikka
Copy link
Collaborator

dsikka commented Jan 21, 2025

@rahul-tuli infer_quantization_config had a signature change which is now failing tests

dsikka
dsikka previously approved these changes Jan 21, 2025
@dsikka dsikka requested review from horheynm and kylesayrs January 21, 2025 18:28
tokenizer.save_pretrained(save_dir)
```

> **Note:** This will compress the model using the quantization compressor; however, instead of using the optimal sparsity compressor, the dense sparsity compressor will be used. This affects only how the model is saved on disk and does not change the actual pruning/quantization process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
dense sparsity compressor is a kind of weird/confusing concept and could use some clarification

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better?

Copy link

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now once the tests are green

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@rahul-tuli
Copy link
Collaborator Author

Failing tests, are unrelated to this PR and also fail on main

@mgoin mgoin merged commit f46d140 into main Jan 22, 2025
6 of 7 checks passed
@mgoin mgoin deleted the composability-v2 branch January 22, 2025 16:56
dsikka pushed a commit that referenced this pull request Jan 22, 2025
This PR addresses two key updates:  

1. **Test Update**:  
In [PR #948](#948), a
flag name was updated during the review process. However, this update
wasn't reflected in the relevant test. This PR propagates the updated
flag name to the associated test.

2. **Sparsity Threshold Adjustment**:  
As requested in [PR
#948](#948), the
sparsity threshold has been reduced to `0.49`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants