-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLM Support via GPTQ Hooks and Data Pipelines #914
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
…s. Requires patching modeling_llava
Signed-off-by: Kyle Sayers <[email protected]>
input_names = state.data.calib.dataset.column_names | ||
unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError) | ||
try: | ||
run_sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do "Layer Sequential" and "Subgraph Sequential" ? Sequential being indicative of the data/error propagation while using "layer" and "subgraph" to differentiate between data structures?
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
…allbacks Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
input_names = state.data.calib.dataset.column_names | ||
unfixable_errors = (torch.OutOfMemoryError, torch._C._LinAlgError) | ||
try: | ||
run_sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm let me think of other descriptors
I think we just want each of the pipelines beyond the basic pipeline to be a little more verbose in its name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these Traceable model definitions have very opaque changes compared to the reference model definitions. This architecture seems like an intensive blocker to add support for a new model, as it requires a lot of knowledge of tracing limitations. However I understand the need - I'll look in more detail tomorrow
@mgoin I think the Tracing Guide will clarify how and why to make changes to your model to make it traceable and why tracing is the best and least invasive solution currently available. Also note that
|
Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for responding to comments, let's get to the followup items after this
Purpose
Llama_3 2-Vision Graphviz
Related Issues
Prerequisites
Changes
VLM Support
examples/multimodal_vision
custom_offload_device_map
to support models which are notXForCausalLM
src/llmcompressor/transformers/utils/data_collator.py
GPTQModifier
GPTQModifier
gptq_wrapper.py
are now implemented ingptq_quantize.py
offload_hessians
parameter inGPTQModifier
GPTQModifier
sequential
pipelinelayer_sequential
pipelineofflo ad_hessians
ValueError
to a_LinAlgError
so it can be ignored by the gptq pipeline fallback mechanismData Pipelines
LayerCompressor
as a straight-forward data pipelineIntermediatesCache
to handle activation offloadingtorch.fx
to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs areSubgraph
) is compiled as an executable python function with the proper inputs and outputsIntermediatesCache
to handle activation offloadingIntermediatesCache
which automagically handles the offloading and onloading of activations from batchesTuple
s and dataclasses such asBaseModelOutputWithPast
tests/llmcompressor/pipelines/test_cache.py
Tracing
# vllm-project: no copyright
was added in similar style to text_generation.pyFuture Work/ Follow ups
TraceableMistralForCausalLM
#1052Winogrande Evaluations
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1
MMMU Evaluations
Credit to @shubhra
Testing