Implement HooksMixin #917

kylesayrs · 2024-11-14T23:57:20Z

Purpose

Precursor to VLM Support via GPTQ Hooks and Data Pipelines #914
Create a shared API for adding hooks to modules
Allow code which handles data pipelines to selectively disable hooks for certain passes. This will be needed in cases with custom datapipelines (GPTQ/Wanda/SparseGPTQ) and when multiple modifiers are active at the same time.
- This is needed for GPTQ-style sequential algorithms which require one pass with hooks in order to accumulate the hessians and compress, and then a second pass without hooks in order to compute compressed (weight-quantized) outputs
- This is also a tool for research users to be able to control when hooks are enabled from within the data pipelines

for layer in model_layers:
    # accumulate hessians
    unquantized_outputs = layer(*args, **kwargs)

    # get sequential outputs
    with HooksMixin.disable_hooks():
        quantized_outputs = layer(*args, **kwargs)
    
    print(f"Mean error from quantization: {get_loss(unquantized_outputs, quantized_outputs)}")

Changes

Implement HooksMixin
- The _HOOKS_DISABLED attribute is a global variable attached to the class which is used to disable hooks globally
- The _hooks attribute is a local variable attached to each modifier which lists all of the hooks created by that modifier
Integrate with QuantizationModifier, refactor calibration functions to reference the same function rather than generating hook functions
Integrate with SmoothQuantModifier
Integrate with WandaPruningModifier and SparseGPTModifier
Integrate with MagnitudePruningModifier and ConstantPruningModifier via LayerParamMasking
Purposefully did not integrate with LayerCompressor since this will be handled by future data pipelines and doing so would all the BaseModel inheritance to the LayerCompressor class, which add unnecessary complexity to this PR

Testing

Added tests in tests/llmcompressor/modifiers/utils/test_hooks.py

github-actions · 2024-11-14T23:57:34Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2024-11-17T20:00:05Z

e2e tests
nightly: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/11897900649 ✅

dsikka

We briefly looked at the implications of using hooks with FSDP - are we taking care of that already or through this PR?

kylesayrs · 2024-11-18T22:57:57Z

@dsikka I consider that to be out of scope for this PR. I consider FSDP to be unsupported as of now, although this PR makes it easier to support FSDP in the future.

Modifying a module's parameter requires being in special FSDP contexts.

@torch.no_grad()
def pre_hook(module, _args):
  # modifying both training and handle training states is required
  with model._use_training_state(TrainingState.IDLE, HandleTrainingState.IDLE):
    with FullyShardedDataParallel.summon_full_params(model):
      # modify module weight. Doing so outside of the contexts will raise a non-contiguous tensor error
      module.weight *= 0

We can bake these contexts into the HooksMixin.register_hook function, although there's implementation details associated with that I'd like to leave that for a separate task/PR.

dsikka

Overall looks good in cleaning up/unifying hooks

Current testing should test the changes with the QuantizationModifier- do we think this is the case for the other modifiers being tested?

The other thought I had was about a less common but potentially useful use case where a modifier may have hooks for different cases and may want to target turning off a specific subset as opposed to all of them - do we think the hooks mixin class can be extended easily to handle that?

src/llmcompressor/modifiers/utils/hooks.py

tests/llmcompressor/modifiers/utils/test_hooks.py

src/llmcompressor/modifiers/quantization/quantization/base.py

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2024-11-26T19:54:10Z

@dsikka

Current testing should test the changes with the QuantizationModifier- do we think this is the case for the other modifiers being tested?

I've tested with the e2e tests, although I can perform more rigorous testing if we think that's necessary.

The other thought I had was about a less common but potentially useful use case where a modifier may have hooks for different cases and may want to target turning off a specific subset as opposed to all of them - do we think the hooks mixin class can be extended easily to handle that?

Yes! There are good arguments to be made for enabling this kind of functionality within the GPTQ algorithm, and unifying hooks makes implementing this functionality much easier.

dsikka

I'd suggest checking out the nightly test cases and making sure we're not running any issues there. LGTM.

src/llmcompressor/modifiers/utils/hooks.py

dsikka · 2024-12-05T20:39:34Z

oh ignore my nightly comment.

@shubhra

## Purpose ## * Enable oneshot quantization of vision-language models ![VLM Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543) [Llama_3 2-Vision Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0) ## Related Issues ## * Fixes #91 * Fixes #961 * Fixes #990 ## Prerequisites ## * neuralmagic/compressed-tensors#193 * #917 * #943 * #955 * #950 * #998 * #1014 ## Changes ## ### VLM Support ### * Add multimodal examples in `examples/multimodal_vision` * Modify `custom_offload_device_map` to support models which are not `XForCausalLM` * Add custom data collators for VLM models in `src/llmcompressor/transformers/utils/data_collator.py` ### GPTQModifier ### * Implement hooks-based compression in `GPTQModifier` * This replaces layer-compressor, which made many assumptions about model architecture * This also enables finer-grained sequential compression such as [true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential) * Functions previously implemented in `gptq_wrapper.py` are now implemented in `gptq_quantize.py` * Implement `offload_hessians` parameter in `GPTQModifier` * Implement data-pipelines-based calibration in `GPTQModifier` * First an attempt will be made to trace the model and run the `sequential` pipeline * If that fails, assumptions will be made about the model architecture and an attempt will be made to run the `layer_sequential` pipeline * This ensures backwards compatibility with any previously supported models * If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using `offlo ad_hessians` * Change hessian instability from a `ValueError` to a `_LinAlgError` so it can be ignored by the gptq pipeline fallback mechanism * Add support for conv2d as indicated by [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54) ### Data Pipelines ### * Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers * Basic Pipeline * Performs standard forward passes through the model with provided dataloader * Used as fallback, as well as in the future for basic calibration passes * Layer Sequential Pipeline * Refactor of `LayerCompressor` as a straight-forward data pipeline * Uses `IntermediatesCache` to handle activation offloading * Sequential Pipeline * Utilizes graph tracing implemented by `torch.fx` to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are * Implements BFS algorithm to assign nodes to partitions * An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr) * Each partition (`Subgraph`) is compiled as an executable python function with the proper inputs and outputs * Uses `IntermediatesCache` to handle activation offloading * Implement `IntermediatesCache` which automagically handles the offloading and onloading of activations from batches * This class is capable of offloading many non-standard activation types such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast` * For convenience, the class also handles masking padding * The class is tested in `tests/llmcompressor/pipelines/test_cache.py` ### Tracing ### * In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing * If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable * For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower * Add traceable model definitions for llava, mistral, mllama, and glm * All copyright licenses allow for alteration and redistribution, the line `# vllm-project: no copyright` was added in similar style to [text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18) ## Future Work/ Follow ups ## * #1027 * #1032 * #1039 * #1030 * Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning * Better support prompt masking for multimodal processors in order to support VLM fine tuning ## Winogrande Evaluations ## Model | Dataset | Scheme | Runtime | Winogrande | -- | -- | -- | -- | -- Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32` `lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1` ## MMMU Evaluations ## Credit to @shubhra Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-11B-Vision | N/A | Dense | 0.4144 Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300 Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377 Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Llama-3.2-90B-Vision | N/A | Dense | 0.5388 Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278 Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111 Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477 Model | Dataset | Scheme | MMMU -- | -- | -- | -- Pixtral-12B-2409 | N/A | Dense | 0.5022 Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322 Pixtral-12B-2409 | flickr | W4A16 | 0.4500 Pixtral-12B-2409 | flickr | W4A16-group | 0.4689 ## Testing ## * [Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996) --------- Signed-off-by: Kyle Sayers <[email protected]> Co-authored-by: Dipika Sikka <[email protected]>

kylesayrs force-pushed the kylesayrs/HooksMixin branch from ec59d6c to 45953c4 Compare November 15, 2024 00:06

kylesayrs self-assigned this Nov 15, 2024

kylesayrs force-pushed the kylesayrs/HooksMixin branch from 840a41b to 0bc7bae Compare November 15, 2024 20:45

kylesayrs added 7 commits November 15, 2024 21:50

Implement HooksMixin

2690e10

Signed-off-by: Kyle Sayers <[email protected]>

add docstring

004f5c7

Signed-off-by: Kyle Sayers <[email protected]>

integrate with smoothquant

d3058f0

Signed-off-by: Kyle Sayers <[email protected]>

integrate with QuantizationModifier

1ae3ce0

Signed-off-by: Kyle Sayers <[email protected]>

update hooks in tests

fc2488f

Signed-off-by: Kyle Sayers <[email protected]>

integrate with wanda

d0dc807

Signed-off-by: Kyle Sayers <[email protected]>

integrate with magnitude and constant

55f69d6

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/HooksMixin branch from 793ae75 to 55f69d6 Compare November 15, 2024 21:50

kylesayrs added 3 commits November 15, 2024 21:56

integrate with SparseGPTModifier

59ffe44

Signed-off-by: Kyle Sayers <[email protected]>

add hooksmixin to modifier

21fe61b

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

ba01137

kylesayrs requested review from rahul-tuli, dsikka and horheynm November 18, 2024 19:23

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

3771a89

dsikka reviewed Nov 18, 2024

View reviewed changes

kylesayrs mentioned this pull request Nov 19, 2024

VLM Support via GPTQ Hooks and Data Pipelines #914

Merged

Merge branch 'main' into kylesayrs/HooksMixin

7fd142b

dsikka reviewed Nov 22, 2024

View reviewed changes

src/llmcompressor/modifiers/utils/hooks.py Outdated Show resolved Hide resolved

tests/llmcompressor/modifiers/utils/test_hooks.py Show resolved Hide resolved

src/llmcompressor/modifiers/quantization/quantization/base.py Outdated Show resolved Hide resolved

kylesayrs added 3 commits November 25, 2024 16:21

nits

0539df7

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

a734393

Merge branch 'main' into kylesayrs/HooksMixin

182be1c

Merge branch 'main' into kylesayrs/HooksMixin

2f65fc2

kylesayrs requested a review from dsikka December 3, 2024 16:34

horheynm approved these changes Dec 5, 2024

View reviewed changes

dsikka reviewed Dec 5, 2024

View reviewed changes

src/llmcompressor/modifiers/utils/hooks.py Show resolved Hide resolved

src/llmcompressor/modifiers/utils/hooks.py Show resolved Hide resolved

Merge branch 'main' into kylesayrs/HooksMixin

155a5b0

dsikka approved these changes Dec 5, 2024

View reviewed changes

Merge branch 'main' into kylesayrs/HooksMixin

ed854b6

dsikka merged commit 9f58887 into main Dec 6, 2024
6 of 7 checks passed

dsikka deleted the kylesayrs/HooksMixin branch December 6, 2024 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement HooksMixin #917

Implement HooksMixin #917

kylesayrs commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024

kylesayrs commented Nov 17, 2024 •

edited

Loading

dsikka left a comment

kylesayrs commented Nov 18, 2024

dsikka left a comment

kylesayrs commented Nov 26, 2024

dsikka left a comment

dsikka commented Dec 5, 2024

Implement HooksMixin #917

Implement HooksMixin #917

Conversation

kylesayrs commented Nov 14, 2024 • edited Loading

Purpose

Changes

Testing

github-actions bot commented Nov 14, 2024

kylesayrs commented Nov 17, 2024 • edited Loading

dsikka left a comment

Choose a reason for hiding this comment

kylesayrs commented Nov 18, 2024

dsikka left a comment

Choose a reason for hiding this comment

kylesayrs commented Nov 26, 2024

dsikka left a comment

Choose a reason for hiding this comment

dsikka commented Dec 5, 2024

kylesayrs commented Nov 14, 2024 •

edited

Loading

kylesayrs commented Nov 17, 2024 •

edited

Loading