-
Notifications
You must be signed in to change notification settings - Fork 74
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Composability with sparse and quantization compressors (#948)
This PR enables accomplishes the following: - Increases the sparsity threshold to 50% - Allow sparse + quantized compression-decompression on llm-compressor side - Adds a test for sparse+quantized compression-decompression Needs: neuralmagic/compressed-tensors#241 ## Choices of Compressor's for different cases: | Quantization | Sparsity | Quant Compressor Format | Sparsity Compressor Format | |--------------- |---------- |------------------------- |---------------------------- | | W8A8 - int | None | int_quantized | Dense | | W8A8 - float | None | float_quantized | Dense | | W4A16 - int | None | pack_quantized | Dense | | W8A16 - int | None | pack_quantized | Dense | | W8A16 - float | None | naive_quantized | Dense | | W8A8 - int | 2:4 | int_quantized | Sparse24 | | W8A8 - float | 2:4 | float_quantized | Sparse24 | | W4A16 - int | 2:4 | marlin_24 | Dense | | W8A16 - int | 2:4 | marlin_24 | Dense | | W8A16 - float | 2:4 | naive_quantized | Dense | ## Explanation - **Quantization Format**: Specifies the quantization compression format used. - **Compressed Flag**: Boolean flag indicating whether the model should be compressed on disk. - **Sparsity Structure**: Type of sparsity structure inferred. Can be structured or unstructured. - **Sparse Compression Format**: Resulting compression format based on the `quantization_format`, `compressed_flag`, and sparsity structure. ### Notes 1. If the global sparsity (`global_sparsity`) is less than `0.05`, no compression configuration is returned. 2. For `CompressionFormat.marlin_24`, the compression format defaults to `dense` regardless of sparsity structure. 3. For other quantization formats: - If `compressed_flag` is `True` and the sparsity structure is `TWO_FOUR`, the compression format is `sparse_24_bitmask`. (This is the only sparse compressor supported in vllm) - Otherwise, the compression format defaults to `dense`. 4. The `SparsityThreshold` (default `0.5`) is used to determine **targets** and **ignores** in the model by evaluating parameter sparsity. ### Additional Information - **Targets**: Parameters selected for sparsification based on the `SparsityThreshold`. - **Ignores**: Parameters excluded from sparsification based on the `SparsityThreshold`. - These are inferred using the `infer_sparse_targets_and_ignores` function, considering the sparsity structure and threshold. --------- Signed-off-by: Rahul Tuli <[email protected]>
- Loading branch information
1 parent
4b805fe
commit f46d140
Showing
9 changed files
with
670 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
tests/llmcompressor/transformers/compression/recipes/sparse_24.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
pruning_stage: | ||
obcq_modifiers: | ||
SparseGPTModifier: | ||
sparsity: 0.5 | ||
sequential_update: true | ||
mask_structure: "2:4" | ||
targets: ['re:model.layers.\d*$'] |
38 changes: 38 additions & 0 deletions
38
tests/llmcompressor/transformers/compression/recipes/sparse_24_fp8.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
pruning_stage: | ||
obcq_modifiers: | ||
SparseGPTModifier: | ||
sparsity: 0.5 | ||
sequential_update: true | ||
mask_structure: "2:4" | ||
targets: ['re:model.layers.\d*$'] | ||
quant_stage: | ||
quant_modifiers: | ||
QuantizationModifier: | ||
ignore: ["lm_head"] | ||
config_groups: | ||
group_0: | ||
weights: | ||
num_bits: 8 | ||
type: float | ||
strategy: channel | ||
dynamic: false | ||
symmetric: true | ||
input_activations: | ||
num_bits: 8 | ||
type: float | ||
strategy: token | ||
dynamic: true | ||
symmetric: true | ||
targets: ["Linear"] | ||
pruning_modifiers: | ||
ConstantPruningModifier: | ||
targets: [ | ||
're:.*q_proj.weight', | ||
're:.*k_proj.weight', | ||
're:.*v_proj.weight', | ||
're:.*o_proj.weight', | ||
're:.*gate_proj.weight', | ||
're:.*up_proj.weight', | ||
're:.*down_proj.weight', | ||
] | ||
start: 0 |
Oops, something went wrong.