Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HooksMixin #917

Merged
merged 18 commits into from
Dec 6, 2024
Merged

Implement HooksMixin #917

merged 18 commits into from
Dec 6, 2024

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Nov 14, 2024

Purpose

  • Precursor to VLM Support via GPTQ Hooks and Data Pipelines #914
  • Create a shared API for adding hooks to modules
  • Allow code which handles data pipelines to selectively disable hooks for certain passes. This will be needed in cases with custom datapipelines (GPTQ/Wanda/SparseGPTQ) and when multiple modifiers are active at the same time.
    • This is needed for GPTQ-style sequential algorithms which require one pass with hooks in order to accumulate the hessians and compress, and then a second pass without hooks in order to compute compressed (weight-quantized) outputs
    • This is also a tool for research users to be able to control when hooks are enabled from within the data pipelines
for layer in model_layers:
    # accumulate hessians
    unquantized_outputs = layer(*args, **kwargs)

    # get sequential outputs
    with HooksMixin.disable_hooks():
        quantized_outputs = layer(*args, **kwargs)
    
    print(f"Mean error from quantization: {get_loss(unquantized_outputs, quantized_outputs)}")

Changes

  • Implement HooksMixin
    • The _HOOKS_DISABLED attribute is a global variable attached to the class which is used to disable hooks globally
    • The _hooks attribute is a local variable attached to each modifier which lists all of the hooks created by that modifier
  • Integrate with QuantizationModifier, refactor calibration functions to reference the same function rather than generating hook functions
  • Integrate with SmoothQuantModifier
  • Integrate with WandaPruningModifier and SparseGPTModifier
  • Integrate with MagnitudePruningModifier and ConstantPruningModifier via LayerParamMasking
  • Purposefully did not integrate with LayerCompressor since this will be handled by future data pipelines and doing so would all the BaseModel inheritance to the LayerCompressor class, which add unnecessary complexity to this PR

Testing

  • Added tests in tests/llmcompressor/modifiers/utils/test_hooks.py

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

@kylesayrs kylesayrs force-pushed the kylesayrs/HooksMixin branch from ec59d6c to 45953c4 Compare November 15, 2024 00:06
@kylesayrs kylesayrs self-assigned this Nov 15, 2024
@kylesayrs kylesayrs force-pushed the kylesayrs/HooksMixin branch from 840a41b to 0bc7bae Compare November 15, 2024 20:45
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/HooksMixin branch from 793ae75 to 55f69d6 Compare November 15, 2024 21:50
@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Nov 17, 2024

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We briefly looked at the implications of using hooks with FSDP - are we taking care of that already or through this PR?

@kylesayrs
Copy link
Collaborator Author

@dsikka I consider that to be out of scope for this PR. I consider FSDP to be unsupported as of now, although this PR makes it easier to support FSDP in the future.

Modifying a module's parameter requires being in special FSDP contexts.

@torch.no_grad()
def pre_hook(module, _args):
  # modifying both training and handle training states is required
  with model._use_training_state(TrainingState.IDLE, HandleTrainingState.IDLE):
    with FullyShardedDataParallel.summon_full_params(model):
      # modify module weight. Doing so outside of the contexts will raise a non-contiguous tensor error
      module.weight *= 0

We can bake these contexts into the HooksMixin.register_hook function, although there's implementation details associated with that I'd like to leave that for a separate task/PR.

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good in cleaning up/unifying hooks

Current testing should test the changes with the QuantizationModifier- do we think this is the case for the other modifiers being tested?

The other thought I had was about a less common but potentially useful use case where a modifier may have hooks for different cases and may want to target turning off a specific subset as opposed to all of them - do we think the hooks mixin class can be extended easily to handle that?

src/llmcompressor/modifiers/utils/hooks.py Outdated Show resolved Hide resolved
@kylesayrs
Copy link
Collaborator Author

@dsikka

Current testing should test the changes with the QuantizationModifier- do we think this is the case for the other modifiers being tested?

I've tested with the e2e tests, although I can perform more rigorous testing if we think that's necessary.

The other thought I had was about a less common but potentially useful use case where a modifier may have hooks for different cases and may want to target turning off a specific subset as opposed to all of them - do we think the hooks mixin class can be extended easily to handle that?

Yes! There are good arguments to be made for enabling this kind of functionality within the GPTQ algorithm, and unifying hooks makes implementing this functionality much easier.

@kylesayrs kylesayrs requested a review from dsikka December 3, 2024 16:34
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest checking out the nightly test cases and making sure we're not running any issues there. LGTM.

src/llmcompressor/modifiers/utils/hooks.py Show resolved Hide resolved
src/llmcompressor/modifiers/utils/hooks.py Show resolved Hide resolved
@dsikka
Copy link
Collaborator

dsikka commented Dec 5, 2024

oh ignore my nightly comment.

@dsikka dsikka merged commit 9f58887 into main Dec 6, 2024
6 of 7 checks passed
@dsikka dsikka deleted the kylesayrs/HooksMixin branch December 6, 2024 03:18
dsikka added a commit that referenced this pull request Jan 8, 2025
## Purpose ##
* Enable oneshot quantization of vision-language models

![VLM
Banner](https://github.com/user-attachments/assets/0d748714-b524-44f4-b850-a721f35d5543)
[Llama_3 2-Vision
Graphviz](https://github.com/user-attachments/assets/6b371ccc-f9f6-4bf2-b4cd-24ed75a3cad0)

## Related Issues ##
* Fixes #91
* Fixes #961
* Fixes #990

## Prerequisites ##
* neuralmagic/compressed-tensors#193
* #917
* #943
  * #955
    * #950
* #998
* #1014

## Changes ##
### VLM Support ###
* Add multimodal examples in `examples/multimodal_vision`
* Modify `custom_offload_device_map` to support models which are not
`XForCausalLM`
* Add custom data collators for VLM models in
`src/llmcompressor/transformers/utils/data_collator.py`

### GPTQModifier ###
* Implement hooks-based compression in `GPTQModifier`
* This replaces layer-compressor, which made many assumptions about
model architecture
* This also enables finer-grained sequential compression such as
[true_sequential](https://huggingface.co/docs/transformers/main_classes/quantization#transformers.GPTQConfig.true_sequential)
* Functions previously implemented in `gptq_wrapper.py` are now
implemented in `gptq_quantize.py`
* Implement `offload_hessians` parameter in `GPTQModifier`
* Implement data-pipelines-based calibration in `GPTQModifier`
* First an attempt will be made to trace the model and run the
`sequential` pipeline
* If that fails, assumptions will be made about the model architecture
and an attempt will be made to run the `layer_sequential` pipeline
* This ensures backwards compatibility with any previously supported
models
* If that fails, then the basic pipeline will be used, which is
guaranteed to run but may require using `offlo ad_hessians`
* Change hessian instability from a `ValueError` to a `_LinAlgError` so
it can be ignored by the gptq pipeline fallback mechanism
* Add support for conv2d as indicated by
[AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ/blob/6689349625de973b9ee3016c28c11f32acf7f02c/auto_gptq/quantization/gptq.py#L45-L54)

### Data Pipelines ###
* Implement the basic skeletons of data pipelines, which are subject to
change when data pipelines are pulled out of modifiers
* Basic Pipeline
* Performs standard forward passes through the model with provided
dataloader
* Used as fallback, as well as in the future for basic calibration
passes
* Layer Sequential Pipeline
  * Refactor of `LayerCompressor` as a straight-forward data pipeline
  * Uses `IntermediatesCache` to handle activation offloading
* Sequential Pipeline
* Utilizes graph tracing implemented by `torch.fx` to trace the graph in
order to determine where sequential targets (layers) exist in the graph
and what their inputs and outputs are
  * Implements BFS algorithm to assign nodes to partitions
* An ideal implementation consolidates partition indices to assign each
node to the latest possible partition, delaying execution. The current
implementation addresses the most common case (node.op == get_attr)
* Each partition (`Subgraph`) is compiled as an executable python
function with the proper inputs and outputs
  * Uses `IntermediatesCache` to handle activation offloading
* Implement `IntermediatesCache` which automagically handles the
offloading and onloading of activations from batches
* This class is capable of offloading many non-standard activation types
such as `Tuple`s and dataclasses such as `BaseModelOutputWithPast`
  * For convenience, the class also handles masking padding
  * The class is tested in `tests/llmcompressor/pipelines/test_cache.py`

### Tracing ###
* In order to support sequential quantization of the large variety of
different multimodal model architectures, some model definitions have to
be altered to support tracing
* If the calibration dataset is text only, most LLMs and VLMs are
traceable without additional work. Multimodal calibration datasets are
more likely to require additional work to make tracable
* For many VLMs (but not all), the vision tower is not traceable without
significant work. However, this only affects sequential error
propagation and (minimal?) increased memory usage, which leaves the door
open for future support for quantizing modules in the vision tower
* Add traceable model definitions for llava, mistral, mllama, and glm
* All copyright licenses allow for alteration and redistribution, the
line `# vllm-project: no copyright` was added in similar style to
[text_generation.py](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/transformers/finetune/text_generation.py#L18)

## Future Work/ Follow ups ##
* #1027
* #1032
* #1039
* #1030
* Create better data collators capable of handling larger batch sizes in
order to support VLM fine tuning
* Better support prompt masking for multimodal processors in order to
support VLM fine tuning

## Winogrande Evaluations ##

Model | Dataset | Scheme | Runtime | Winogrande |
-- | -- | -- | -- | --
Llama-3-8B | ultrachat | W4A16 | 43m, 2xA4000 | 0.7545 
Llama-3-70B | ultrachat | W4A16 | 303m, 1xH100 | 0.8216 
Mixtral-8x7B | ultrachat | W4A16 | 317m, 1xA100 | 0.8200 
openbmb/MiniCPM3-4B | ultrachat | W4A16 | 63m, 1xA100 | 0.6701 
Qwen2-VL-2B-Instruct | ultrachat | W8A8 | 12m, 2xA4000 | 0.6188 
Qwen2-VL-2B-Instruct | flickr | W8A8 | 24m, 2xA4000 | 0.6093 
Llama-3.2-11B-Vision-Instruct | flickr | W8A8 | 75m, 1xA100 | 0.7837 
Pixtral-12B-2409 | flickr | W8A8 | 52m, 1xA100 | 0.7924 
llava-1.5-7b-hf | flickr | W8A8 | 15m, 1xH100 | 0.7214 
Phi-3-vision-128k-instruct | flickr | W4A16 | 51m, 1xA100 | 0.7151 

`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True
--tasks winogrande --num_fewshot 5 --batch_size 32`
`lm_eval --model vllm --model_args
pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1
--tasks winogrande --num_fewshot 5 --batch_size 1`

## MMMU Evaluations ##
Credit to @shubhra 

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-11B-Vision | N/A | Dense | 0.4144
Llama-3.2-11B-Vision | N/A | FP8-dynamic | 0.4300
Llama-3.2-11B-Vision | flickr | W4A16 | 0.4377
Llama-3.2-11B-Vision | flickr | W4A16-group | 0.4211

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Llama-3.2-90B-Vision | N/A | Dense | 0.5388
Llama-3.2-90B-Vision | N/A | FP8-dynamic | 0.5278
Llama-3.2-90B-Vision | flickr | W4A16 | 0.5111
Llama-3.2-90B-Vision | flickr | W4A16-group | 0.5477

Model | Dataset | Scheme | MMMU
-- | -- | -- | --
Pixtral-12B-2409 | N/A | Dense | 0.5022
Pixtral-12B-2409 | N/A | FP8-dynamic | 0.5322
Pixtral-12B-2409 | flickr | W4A16 | 0.4500
Pixtral-12B-2409 | flickr | W4A16-group | 0.4689

## Testing ##
*
[Nightly](https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996)

---------

Signed-off-by: Kyle Sayers <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants