VLM Support via GPTQ Hooks and Data Pipelines #914

kylesayrs · 2024-11-13T21:35:23Z

Purpose

Enable oneshot quantization of vision-language models

Llama_3 2-Vision Graphviz

Related Issues

Prerequisites

Changes

VLM Support

Add multimodal examples in examples/multimodal_vision
Modify custom_offload_device_map to support models which are not XForCausalLM
Add custom data collators for VLM models in src/llmcompressor/transformers/utils/data_collator.py

GPTQModifier

Implement hooks-based compression in GPTQModifier
- This replaces layer-compressor, which made many assumptions about model architecture
- This also enables finer-grained sequential compression such as true_sequential
- Functions previously implemented in gptq_wrapper.py are now implemented in gptq_quantize.py
Implement offload_hessians parameter in GPTQModifier
Implement data-pipelines-based calibration in GPTQModifier
- First an attempt will be made to trace the model and run the sequential pipeline
- If that fails, assumptions will be made about the model architecture and an attempt will be made to run the layer_sequential pipeline
  - This ensures backwards compatibility with any previously supported models
- If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using offlo ad_hessians
Change hessian instability from a ValueError to a _LinAlgError so it can be ignored by the gptq pipeline fallback mechanism
Add support for conv2d as indicated by AutoGPTQ

Data Pipelines

Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers
Basic Pipeline
- Performs standard forward passes through the model with provided dataloader
- Used as fallback, as well as in the future for basic calibration passes
Layer Sequential Pipeline
- Refactor of LayerCompressor as a straight-forward data pipeline
- Uses IntermediatesCache to handle activation offloading
Sequential Pipeline
- Utilizes graph tracing implemented by torch.fx to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are
- Implements BFS algorithm to assign nodes to partitions
  - An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr)
- Each partition (Subgraph) is compiled as an executable python function with the proper inputs and outputs
- Uses IntermediatesCache to handle activation offloading
Implement IntermediatesCache which automagically handles the offloading and onloading of activations from batches
- This class is capable of offloading many non-standard activation types such as Tuples and dataclasses such as BaseModelOutputWithPast
- For convenience, the class also handles masking padding
- The class is tested in tests/llmcompressor/pipelines/test_cache.py

Tracing

In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing
- If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable
- For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower
Add traceable model definitions for llava, mistral, mllama, and glm
All copyright licenses allow for alteration and redistribution, the line # vllm-project: no copyright was added in similar style to text_generation.py

Future Work/ Follow ups

VLM: TraceableQwen2VLForConditionalGeneration #1027
VLM: Phi3 Vision Example #1032
VLM: TraceableChatGLMForConditionalGeneration #1039
VLM: Model Tracing Guide #1030
Remove TraceableMistralForCausalLM #1052
Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning
Better support prompt masking for multimodal processors in order to support VLM fine tuning
Add true_sequential option
lm_head and embedding quantization
Potential uses targeting hooks on operations rather than just modules, as would be the case with post-rope transformation hooks
Use vision prompt for vision examples sample output

Winogrande Evaluations

Model	Dataset	Scheme	Runtime	Winogrande
Llama-3-8B	ultrachat	W4A16	43m, 2xA4000	0.7545
Llama-3-70B	ultrachat	W4A16	303m, 1xH100	0.8216
Mixtral-8x7B	ultrachat	W4A16	317m, 1xA100	0.8200
openbmb/MiniCPM3-4B	ultrachat	W4A16	63m, 1xA100	0.6701
Qwen2-VL-2B-Instruct	ultrachat	W8A8	12m, 2xA4000	0.6188
Qwen2-VL-2B-Instruct	flickr	W8A8	24m, 2xA4000	0.6093
Llama-3.2-11B-Vision-Instruct	flickr	W8A8	75m, 1xA100	0.7837
Pixtral-12B-2409	flickr	W8A8	52m, 1xA100	0.7924
llava-1.5-7b-hf	flickr	W8A8	15m, 1xH100	0.7214
Phi-3-vision-128k-instruct	flickr	W4A16	51m, 1xA100	0.7151

lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1

MMMU Evaluations

Credit to @shubhra

Model	Dataset	Scheme	MMMU
Llama-3.2-11B-Vision	N/A	Dense	0.4144
Llama-3.2-11B-Vision	N/A	FP8-dynamic	0.4300
Llama-3.2-11B-Vision	flickr	W4A16	0.4377
Llama-3.2-11B-Vision	flickr	W4A16-group	0.4211

Model	Dataset	Scheme	MMMU
Llama-3.2-90B-Vision	N/A	Dense	0.5388
Llama-3.2-90B-Vision	N/A	FP8-dynamic	0.5278
Llama-3.2-90B-Vision	flickr	W4A16	0.5111
Llama-3.2-90B-Vision	flickr	W4A16-group	0.5477

Model	Dataset	Scheme	MMMU
Pixtral-12B-2409	N/A	Dense	0.5022
Pixtral-12B-2409	N/A	FP8-dynamic	0.5322
Pixtral-12B-2409	flickr	W4A16	0.4500
Pixtral-12B-2409	flickr	W4A16-group	0.4689

Testing

Nightly

Signed-off-by: Kyle Sayers <[email protected]>

…s. Requires patching modeling_llava

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pytorch/utils/helpers.py

src/llmcompressor/pipelines/cache.py

src/llmcompressor/pipelines/layer_sequential/helpers.py

src/llmcompressor/pipelines/sequential/pipeline.py

dsikka · 2025-01-06T19:03:18Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+        input_names = state.data.calib.dataset.column_names
+        unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError)
+        try:
+            run_sequential(


Could we do "Layer Sequential" and "Subgraph Sequential" ? Sequential being indicative of the data/error propagation while using "layer" and "subgraph" to differentiate between data structures?

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/pipelines/sequential/pipeline.py

examples/multimodal_vision/llava_example.py

Signed-off-by: Kyle Sayers <[email protected]>

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-01-06T21:14:38Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/12640439996

rahul-tuli

Really like the IntermediatesCache implementation, good job!

src/llmcompressor/pipelines/basic/pipeline.py

src/llmcompressor/pipelines/layer_sequential/helpers.py

src/llmcompressor/pipelines/layer_sequential/pipeline.py

src/llmcompressor/transformers/tracing/glm/LICENSE

Signed-off-by: Kyle Sayers <[email protected]>

dsikka · 2025-01-07T01:18:23Z

src/llmcompressor/modifiers/quantization/gptq/base.py

+        input_names = state.data.calib.dataset.column_names
+        unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError)
+        try:
+            run_sequential(


hm let me think of other descriptors
I think we just want each of the pipelines beyond the basic pipeline to be a little more verbose in its name

src/llmcompressor/pipelines/sequential/helpers.py

src/llmcompressor/transformers/tracing/mllama.py

examples/multimodal_vision/pixtral_example.py

mgoin

I think these Traceable model definitions have very opaque changes compared to the reference model definitions. This architecture seems like an intensive blocker to add support for a new model, as it requires a lot of knowledge of tracing limitations. However I understand the need - I'll look in more detail tomorrow

src/llmcompressor/pipelines/sequential/helpers.py

src/llmcompressor/transformers/tracing/mistral.py

kylesayrs · 2025-01-07T06:32:28Z

@mgoin I think the Tracing Guide will clarify how and why to make changes to your model to make it traceable and why tracing is the best and least invasive solution currently available.

Also note that

Unlike vllm, custom model definitions are not needed for every model. For the vast majority of text models, custom definitions are not required. Most vision models when calibrated with text datasets also do not require custom tracing. Custom definitions are mostly required for vision models when calibrated with vision datasets, and even then some models like phi3_vision do not require any changes.
Even if a text model is not traceable, gptq falls back to the layer_sequential pipeline, which is equivalent to what is currently on main. Therefore these changes only extend what is possible with llm-compressor now.

Signed-off-by: Kyle Sayers <[email protected]>

dsikka

Great work!

mgoin

Thanks for responding to comments, let's get to the followup items after this

This was referenced Nov 14, 2024

Implement HooksMixin #917

Merged

[Docs] GPTQ Docstring, better argument grouping #841

Closed

kylesayrs added 6 commits November 15, 2024 21:50

update hooks in tests

fc2488f

Signed-off-by: Kyle Sayers <[email protected]>

integrate with wanda

d0dc807

Signed-off-by: Kyle Sayers <[email protected]>

integrate with magnitude and constant

55f69d6

Signed-off-by: Kyle Sayers <[email protected]>

integrate with SparseGPTModifier

59ffe44

Signed-off-by: Kyle Sayers <[email protected]>

add hooksmixin to modifier

21fe61b

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

ba01137

This was referenced Nov 16, 2024

[GPTQ] Vision Model Support #850

Closed

Kylesayrs/gptq batched updates #879

Closed

kylesayrs added 20 commits November 18, 2024 19:30

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

3771a89

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ccc5458

merge

a5635a1

Signed-off-by: Kyle Sayers <[email protected]>

small updates

83ed409

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/HooksMixin

7fd142b

WIP

d104282

WIP

236a47a

able to run without hooks

188896e

issue with different sizes

8ef9c23

able to run through pixtral without issue and using real proxy tensor…

1362ca2

…s. Requires patching modeling_llava

nits

0539df7

Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/HooksMixin

a734393

Merge branch 'kylesayrs/HooksMixin' into kylesayrs/gptq-partition

ea10aed

fix all variable

ed96ee4

tmp

5f26711

wip

ebc2c41

wip

922b407

testing with lots of models

0577f36

preliminary data pipeline

3830696

WIP

1ecaa39

kylesayrs requested a review from dsikka January 5, 2025 05:57

dsikka reviewed Jan 6, 2025

View reviewed changes

kylesayrs added 6 commits January 6, 2025 21:00

replace list comprehesion

d3eebfe

Signed-off-by: Kyle Sayers <[email protected]>

nit: only pass first layer

412086c

Signed-off-by: Kyle Sayers <[email protected]>

revert changes to tensors_to_device

8433304

Signed-off-by: Kyle Sayers <[email protected]>

type hint intermediates cache for clarity

07b3cc3

Signed-off-by: Kyle Sayers <[email protected]>

make hessian instability a _LinAlgError so it can be caught by gptq f…

895b409

…allbacks Signed-off-by: Kyle Sayers <[email protected]>

Merge remote-tracking branch 'origin' into kylesayrs/gptq-partition

18fe751

rahul-tuli previously approved these changes Jan 6, 2025

View reviewed changes

defer chatglm for later

336e064

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed rahul-tuli’s stale review via 336e064 January 6, 2025 23:55

kylesayrs added 2 commits January 7, 2025 00:13

docstrings, reorder pipeline args

f6312d0

Signed-off-by: Kyle Sayers <[email protected]>

correct typos

153a4fa

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs requested review from rahul-tuli and dsikka January 7, 2025 00:22

dsikka reviewed Jan 7, 2025

View reviewed changes

examples/multimodal_vision/pixtral_example.py Show resolved Hide resolved

mgoin reviewed Jan 7, 2025

View reviewed changes

src/llmcompressor/pipelines/sequential/helpers.py Show resolved Hide resolved

src/llmcompressor/pipelines/sequential/helpers.py Show resolved Hide resolved

src/llmcompressor/transformers/tracing/mistral.py Outdated Show resolved Hide resolved

kylesayrs requested review from mgoin and dsikka January 7, 2025 06:35

kylesayrs and others added 2 commits January 7, 2025 18:06

code clarity

3f9dd7d

Signed-off-by: Kyle Sayers <[email protected]>

Merge branch 'main' into kylesayrs/gptq-partition

84db1e0

dsikka approved these changes Jan 7, 2025

View reviewed changes

markurtz self-requested a review January 8, 2025 16:40

rahul-tuli approved these changes Jan 8, 2025

View reviewed changes

mgoin approved these changes Jan 8, 2025

View reviewed changes

dsikka merged commit 03e2177 into main Jan 8, 2025
6 of 7 checks passed

dsikka deleted the kylesayrs/gptq-partition branch January 8, 2025 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM Support via GPTQ Hooks and Data Pipelines #914

VLM Support via GPTQ Hooks and Data Pipelines #914

kylesayrs commented Nov 13, 2024 •

edited

Loading

dsikka Jan 6, 2025

kylesayrs commented Jan 6, 2025

rahul-tuli left a comment

dsikka Jan 7, 2025

mgoin left a comment

kylesayrs commented Jan 7, 2025

dsikka left a comment

mgoin left a comment

VLM Support via GPTQ Hooks and Data Pipelines #914

VLM Support via GPTQ Hooks and Data Pipelines #914

Conversation

kylesayrs commented Nov 13, 2024 • edited Loading

Purpose

Related Issues

Prerequisites

Changes

VLM Support

GPTQModifier

Data Pipelines

Tracing

Future Work/ Follow ups

Winogrande Evaluations

MMMU Evaluations

Testing

dsikka Jan 6, 2025

Choose a reason for hiding this comment

kylesayrs commented Jan 6, 2025

rahul-tuli left a comment

Choose a reason for hiding this comment

dsikka Jan 7, 2025

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

kylesayrs commented Jan 7, 2025

dsikka left a comment

Choose a reason for hiding this comment

mgoin left a comment

Choose a reason for hiding this comment

kylesayrs commented Nov 13, 2024 •

edited

Loading