Composability with sparse and quantization compressors #948

rahul-tuli · 2024-12-02T22:16:44Z

This PR enables accomplishes the following:

Increases the sparsity threshold to 50%
Allow sparse + quantized compression-decompression on llm-compressor side
Adds a test for sparse+quantized compression-decompression

Needs: neuralmagic/compressed-tensors#241

Choices of Compressor's for different cases:

Quantization	Sparsity	Quant Compressor Format	Sparsity Compressor Format
W8A8 - int	None	int_quantized	Dense
W8A8 - float	None	float_quantized	Dense
W4A16 - int	None	pack_quantized	Dense
W8A16 - int	None	pack_quantized	Dense
W8A16 - float	None	naive_quantized	Dense
W8A8 - int	2:4	int_quantized	Sparse24
W8A8 - float	2:4	float_quantized	Sparse24
W4A16 - int	2:4	marlin_24	Dense
W8A16 - int	2:4	marlin_24	Dense
W8A16 - float	2:4	naive_quantized	Dense

Explanation

Quantization Format: Specifies the quantization compression format used.
Compressed Flag: Boolean flag indicating whether the model should be compressed on disk.
Sparsity Structure: Type of sparsity structure inferred. Can be structured or unstructured.
Sparse Compression Format: Resulting compression format based on the quantization_format, compressed_flag, and sparsity structure.

Notes

If the global sparsity (global_sparsity) is less than 0.05, no compression configuration is returned.
For CompressionFormat.marlin_24, the compression format defaults to dense regardless of sparsity structure.
For other quantization formats:
- If compressed_flag is True and the sparsity structure is TWO_FOUR, the compression format is sparse_24_bitmask. (This is the only sparse compressor supported in vllm)
- Otherwise, the compression format defaults to dense.
The SparsityThreshold (default 0.5) is used to determine targets and ignores in the model by evaluating parameter sparsity.

Additional Information

Targets: Parameters selected for sparsification based on the SparsityThreshold.
Ignores: Parameters excluded from sparsification based on the SparsityThreshold.
These are inferred using the infer_sparse_targets_and_ignores function, considering the sparsity structure and threshold.

github-actions · 2024-12-02T22:16:55Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

src/llmcompressor/transformers/compression/quantization_format.py

horheynm · 2024-12-03T16:42:47Z

verified decompression works for sparse and quantized model

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py

dsikka

This looks good. I have a few follow-up questions that I've noted. We can discuss during work session

src/llmcompressor/transformers/compression/sparsity_config.py

rahul-tuli · 2025-01-15T21:54:04Z

Add case in the description where save_compressed is True but unstructured sparsity; which quantized compressor.

Increase Sparsity Threshold Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]>

dsikka · 2025-01-21T03:45:55Z

W4A16 - float

W4A16 - float is not a relevant scheme
We're also missing W8A16 - no sparsity and W8A16 + 2:4 Sparsity

Please update the table

dsikka

Ran a ton of cases and everything seems well covered and integrated well with vLLM.

Few small comments

examples/sparse_2of4_quantization_fp8/README.md

src/llmcompressor/transformers/compression/quantization_format.py

tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py

rahul-tuli · 2025-01-21T07:25:55Z

W4A16 - float

W4A16 - float is not a relevant scheme We're also missing W8A16 - no sparsity and W8A16 + 2:4 Sparsity

Please update the table

Done!

Add Table in docstring Add test for compressor inference

dsikka

Excellent work!

dsikka · 2025-01-21T12:29:51Z

@rahul-tuli infer_quantization_config had a signature change which is now failing tests

tlrmchlsmth · 2025-01-21T18:31:38Z

examples/sparse_2of4_quantization_fp8/README.md

+tokenizer.save_pretrained(save_dir)
+```
+
+> **Note:** This will compress the model using the quantization compressor; however, instead of using the optimal sparsity compressor, the dense sparsity compressor will be used. This affects only how the model is saved on disk and does not change the actual pruning/quantization process.


nit:
dense sparsity compressor is a kind of weird/confusing concept and could use some clarification

Is this better?

src/llmcompressor/transformers/compression/sparsity_config.py

examples/sparse_2of4_quantization_fp8/README.md

src/llmcompressor/transformers/compression/sparsity_config.py

tlrmchlsmth

LGTM now once the tests are green

mgoin

Thank you!

rahul-tuli · 2025-01-22T16:54:13Z

Failing tests, are unrelated to this PR and also fail on main

…rt of PR #948

This PR addresses two key updates: 1. **Test Update**: In [PR #948](#948), a flag name was updated during the review process. However, this update wasn't reflected in the relevant test. This PR propagates the updated flag name to the associated test. 2. **Sparsity Threshold Adjustment**: As requested in [PR #948](#948), the sparsity threshold has been reduced to `0.49`.

dsikka reviewed Dec 2, 2024

View reviewed changes

src/llmcompressor/transformers/compression/quantization_format.py Outdated Show resolved Hide resolved

rahul-tuli force-pushed the composability-v2 branch 2 times, most recently from 9249158 to afc0b5f Compare December 3, 2024 06:16

rahul-tuli changed the title ~~[ DRAFT ] Composability with sparse and quantization compressors~~ Composability with sparse and quantization compressors Dec 3, 2024

horheynm previously approved these changes Dec 3, 2024

View reviewed changes

rahul-tuli dismissed horheynm’s stale review via 480247c December 20, 2024 16:15

rahul-tuli force-pushed the composability-v2 branch 3 times, most recently from 8c3b515 to 4043b65 Compare December 23, 2024 18:37

rahul-tuli force-pushed the composability-v2 branch 2 times, most recently from d1dd1d6 to 98cc518 Compare January 14, 2025 15:09

rahul-tuli mentioned this pull request Jan 14, 2025

BugFix: Shape should be a flat list neuralmagic/compressed-tensors#241

Merged

kylesayrs previously approved these changes Jan 14, 2025

View reviewed changes

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py Outdated Show resolved Hide resolved

rahul-tuli commented Jan 15, 2025

View reviewed changes

src/llmcompressor/transformers/sparsification/compressed_tensors_utils.py Outdated Show resolved Hide resolved

rahul-tuli dismissed kylesayrs’s stale review via c2c332a January 15, 2025 16:29

rahul-tuli force-pushed the composability-v2 branch from c2c332a to ad7c768 Compare January 15, 2025 17:24

dsikka reviewed Jan 15, 2025

View reviewed changes

rahul-tuli commented Jan 15, 2025

View reviewed changes

src/llmcompressor/transformers/compression/sparsity_config.py Show resolved Hide resolved

rahul-tuli added 9 commits January 20, 2025 14:42

Enable Sparse24 quantization for Weight + Activation quantization

c70000e

Increase Sparsity Threshold Signed-off-by: Rahul Tuli <[email protected]>

Add composability test

8bd0476

Signed-off-by: Rahul Tuli <[email protected]>

Review comments from @kylesayrs compare enum directly

9736e19

Signed-off-by: Rahul Tuli <[email protected]>

Bitmask test

9234067

Signed-off-by: Rahul Tuli <[email protected]>

Enable sparse24bytemask compressor

12acdbb

Signed-off-by: Rahul Tuli <[email protected]>

Update: SparseBitMaskCompressor

d9ea5df

Signed-off-by: Rahul Tuli <[email protected]>

Tests

75c7068

Signed-off-by: Rahul Tuli <[email protected]>

Use quantization format directly to infer sparse compressor

fb3642f

Signed-off-by: Rahul Tuli <[email protected]>

Update test

1b341db

Signed-off-by: Rahul Tuli <[email protected]>

dsikka reviewed Jan 21, 2025

View reviewed changes

examples/sparse_2of4_quantization_fp8/README.md Outdated Show resolved Hide resolved

src/llmcompressor/transformers/compression/quantization_format.py Show resolved Hide resolved

tests/llmcompressor/transformers/sparsification/test_compress_tensor_utils.py Show resolved Hide resolved

Address more comments

2c51f2d

Add Table in docstring Add test for compressor inference

rahul-tuli force-pushed the composability-v2 branch from 52d0f7e to 2c51f2d Compare January 21, 2025 09:10

dsikka previously approved these changes Jan 21, 2025

View reviewed changes

Fix test

7a8a7b6

rahul-tuli dismissed dsikka’s stale review via 7a8a7b6 January 21, 2025 15:44

rahul-tuli force-pushed the composability-v2 branch from 4038e72 to 7a8a7b6 Compare January 21, 2025 15:44

rahul-tuli added 2 commits January 21, 2025 09:45

Merge branch 'main' into composability-v2

c23dba7

Fix quality

c31219b

dsikka previously approved these changes Jan 21, 2025

View reviewed changes

dsikka requested review from horheynm and kylesayrs January 21, 2025 18:28

tlrmchlsmth reviewed Jan 21, 2025

View reviewed changes

mgoin reviewed Jan 21, 2025

View reviewed changes

src/llmcompressor/transformers/compression/sparsity_config.py Show resolved Hide resolved

src/llmcompressor/transformers/compression/sparsity_config.py Outdated Show resolved Hide resolved

src/llmcompressor/transformers/compression/sparsity_config.py Outdated Show resolved Hide resolved

kylesayrs reviewed Jan 21, 2025

View reviewed changes

Review comments from @mgoin @kylesayrs and tms

62e0108

rahul-tuli dismissed dsikka’s stale review via 62e0108 January 22, 2025 14:32

rahul-tuli self-assigned this Jan 22, 2025

tlrmchlsmth approved these changes Jan 22, 2025

View reviewed changes

mgoin approved these changes Jan 22, 2025

View reviewed changes

mgoin merged commit f46d140 into main Jan 22, 2025
6 of 7 checks passed

mgoin deleted the composability-v2 branch January 22, 2025 16:56

rahul-tuli added a commit that referenced this pull request Jan 22, 2025

Update sparsity_config threshold to 0.49; missed to commit this as pa…

4636f63

…rt of PR #948

rahul-tuli added a commit that referenced this pull request Jan 22, 2025

Update sparsity_config threshold to 0.49; missed to commit this as pa…

f69f69f

…rt of PR #948

rahul-tuli mentioned this pull request Jan 22, 2025

[Fix Test Failure]: Propagate name change to test #1088

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Composability with sparse and quantization compressors #948

Composability with sparse and quantization compressors #948

rahul-tuli commented Dec 2, 2024 •

edited

Loading

github-actions bot commented Dec 2, 2024

horheynm commented Dec 3, 2024

dsikka left a comment

rahul-tuli commented Jan 15, 2025

dsikka commented Jan 21, 2025 •

edited

Loading

dsikka left a comment

rahul-tuli commented Jan 21, 2025

dsikka left a comment

dsikka commented Jan 21, 2025

tlrmchlsmth Jan 21, 2025

rahul-tuli Jan 22, 2025

tlrmchlsmth left a comment

mgoin left a comment

rahul-tuli commented Jan 22, 2025

Composability with sparse and quantization compressors #948

Composability with sparse and quantization compressors #948

Conversation

rahul-tuli commented Dec 2, 2024 • edited Loading

Choices of Compressor's for different cases:

Explanation

Notes

Additional Information

github-actions bot commented Dec 2, 2024

horheynm commented Dec 3, 2024

dsikka left a comment

Choose a reason for hiding this comment

rahul-tuli commented Jan 15, 2025

dsikka commented Jan 21, 2025 • edited Loading

dsikka left a comment

Choose a reason for hiding this comment

rahul-tuli commented Jan 21, 2025

dsikka left a comment

Choose a reason for hiding this comment

dsikka commented Jan 21, 2025

tlrmchlsmth Jan 21, 2025

Choose a reason for hiding this comment

rahul-tuli Jan 22, 2025

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

rahul-tuli commented Jan 22, 2025

rahul-tuli commented Dec 2, 2024 •

edited

Loading

dsikka commented Jan 21, 2025 •

edited

Loading