Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kylesayrs/gptq batched updates #879

Closed
wants to merge 66 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
98b284b
WIP
kylesayrs Oct 16, 2024
e3a98cc
WIP: begin quantize_weight
kylesayrs Oct 16, 2024
bc9b3bc
WIP
kylesayrs Oct 16, 2024
b77c7bf
WIP
kylesayrs Oct 16, 2024
7be5aed
wip
kylesayrs Oct 16, 2024
e01094f
compilable
kylesayrs Oct 16, 2024
ad9f5a8
compilable
kylesayrs Oct 16, 2024
e4ee0af
wip
kylesayrs Oct 16, 2024
d9ba539
add example
kylesayrs Oct 16, 2024
83a5762
wip
kylesayrs Oct 16, 2024
7f49ab4
runnable
kylesayrs Oct 16, 2024
ac0d926
batching
kylesayrs Oct 21, 2024
6304973
calibration forward context
kylesayrs Oct 21, 2024
868a480
fix stuff
kylesayrs Oct 21, 2024
86c8a06
wip
kylesayrs Oct 21, 2024
1305173
use hooks list
kylesayrs Oct 21, 2024
e6adc5a
layer compressor
kylesayrs Oct 22, 2024
f65f832
style
kylesayrs Oct 22, 2024
1e22569
use layer compressor
kylesayrs Oct 22, 2024
9324695
replicate dtypes
kylesayrs Oct 22, 2024
eef4fb6
write weight changes
kylesayrs Oct 22, 2024
485813a
revert example
kylesayrs Oct 22, 2024
6006155
organization
kylesayrs Oct 22, 2024
c10d2ee
add create_single_batch_dataloader
kylesayrs Oct 22, 2024
6371193
add back empty_cache until I can justify removing it
kylesayrs Oct 22, 2024
92315a5
better type hinting, faster mask applying
kylesayrs Oct 22, 2024
8903fbf
Merge remote-tracking branch 'origin' into kylesayrs/gptq-hooks
kylesayrs Oct 22, 2024
8a25c68
remove breakpoint
kylesayrs Oct 22, 2024
6cd0d6c
apply style, add true_sequential docstring
kylesayrs Oct 22, 2024
0e0c586
update docstring
kylesayrs Oct 22, 2024
d23aabb
use private attrs
kylesayrs Oct 22, 2024
355074b
more docstring
kylesayrs Oct 23, 2024
bf2184d
docstrings
kylesayrs Oct 23, 2024
0b418c7
docstrings
kylesayrs Oct 23, 2024
56cceea
docstrings
kylesayrs Oct 23, 2024
7c7e3bc
move hooksmixin to separate file
kylesayrs Oct 23, 2024
2d52183
docstrings
kylesayrs Oct 23, 2024
d6ff46a
Merge branch 'main' into kylesayrs/gptq-hooks
kylesayrs Oct 23, 2024
9081f12
fix docstring, better arguments grouping
kylesayrs Oct 23, 2024
96e9496
use LayerCompressorMixin
kylesayrs Oct 24, 2024
7fbf8b1
docstrings
kylesayrs Oct 24, 2024
3d3af2a
add back hessian hook to support bs1
kylesayrs Oct 24, 2024
b3021ab
wip
kylesayrs Oct 25, 2024
8508b63
accumulate
kylesayrs Oct 25, 2024
3ff271d
virtualize batches for layers
kylesayrs Oct 25, 2024
d6c6dc3
maybe works, but padding is wrong
kylesayrs Oct 25, 2024
400fa08
WIP
kylesayrs Oct 29, 2024
03515f0
remove hessian
kylesayrs Oct 29, 2024
6e37f64
allocated original weight
kylesayrs Oct 29, 2024
09dae14
proper clone
kylesayrs Oct 29, 2024
944601e
remove breakpoint
kylesayrs Oct 29, 2024
adbcee8
naive_update option
kylesayrs Oct 29, 2024
f4acab2
remove true sequential
kylesayrs Oct 29, 2024
151f566
allow update_offload_parameter to not require data
kylesayrs Oct 29, 2024
76ebc86
bugfix
kylesayrs Oct 29, 2024
3480d6b
ba
kylesayrs Oct 29, 2024
7c55fc5
delete parameter
kylesayrs Oct 29, 2024
0a8004b
sensible generations for small calibration size
kylesayrs Oct 30, 2024
d234b32
remove unnecessary variables
kylesayrs Oct 30, 2024
eeb5c83
remove non-naive updating stuff to focus on naive updating
kylesayrs Oct 30, 2024
99a2d97
Merge remote-tracking branch 'origin' into kylesayrs/gptq-steps
kylesayrs Nov 1, 2024
c7c8d04
use observer to calculate qparams
kylesayrs Nov 1, 2024
f137347
complete, more or less
kylesayrs Nov 5, 2024
593d4fd
support vision datasets
kylesayrs Nov 5, 2024
0bdf98a
use pixtral
kylesayrs Nov 8, 2024
9f43b5d
better stopping
kylesayrs Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions examples/quantization_w4a16/llama3_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
device_map="cuda:0",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
Expand All @@ -20,7 +20,7 @@

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
NUM_CALIBRATION_SAMPLES = 160 #2048
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
Expand Down Expand Up @@ -55,7 +55,7 @@ def tokenize(sample):

# Configure the quantization algorithm to run.
# * quantize the weights to 4 bit with GPTQ with a group size 128
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], batch_size=-1, dampening_frac=0.5)

# Apply algorithms.
oneshot(
Expand Down
83 changes: 83 additions & 0 deletions examples/quantization_w4a16/vision2_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
from datasets import load_dataset
from transformers import AutoProcessor, MllamaForConditionalGeneration

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

# Select model and load it.
MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = MllamaForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="cuda:0",
torch_dtype="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 160 #2048
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
return {
"text": processor.apply_chat_template(
example["messages"],
tokenize=False,
)
}


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
return processor(
None,
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# * quantize the weights to 4 bit with GPTQ with a group size 128
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], batch_size=1, dampening_frac=0.5)

# Apply algorithms.
oneshot(
model=model,
tokenizer=MODEL_ID,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = processor("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(processor.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
88 changes: 88 additions & 0 deletions examples/quantization_w4a16/vision_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from datasets import load_dataset
from transformers import AutoProcessor

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot

# Select model and load it.
MODEL_ID = "meta-llama/Llama-3.2-11B-Vision-Instruct"

model = SparseAutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="cuda:0",
torch_dtype="auto",
)
breakpoint()
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select calibration dataset.
DATASET_ID = "lmms-lab/flickr30k"
DATASET_SPLIT = "test[:165]"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 165 #2048
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
messages = [
[
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
return {
"text": processor.apply_chat_template(
messages,
add_generation_prompt=True,
),
}


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
return processor(sample["image"], sample["text"], add_special_tokens=False, return_tensors="pt", max_length=MAX_SEQUENCE_LENGTH)


ds = ds.map(tokenize, remove_columns=ds.column_names)

# Configure the quantization algorithm to run.
# * quantize the weights to 4 bit with GPTQ with a group size 128
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], batch_size=-1, dampening_frac=0.5)

# Apply algorithms.
oneshot(
model=model,
tokenizer=MODEL_ID,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = processor("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(processor.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
92 changes: 92 additions & 0 deletions shubhra.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
from datasets import load_dataset
from transformers import AutoProcessor, MllamaForConditionalGeneration, LlavaForConditionalGeneration

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot, wrap_hf_model_class
import os

# Load model.
#model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_id = "mgoin/pixtral-12b"
model_class = wrap_hf_model_class(LlavaForConditionalGeneration)
model = model_class.from_pretrained(model_id, device_map="auto", torch_dtype="auto", trust_remote_code=True, _attn_implementation="eager",)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

print("Loading dataset")
DATASET_ID = "lmms-lab/flickr30k"
DATASET_SPLIT = "test[:128]"

NUM_CALIBRATION_SAMPLES = 1#128
MAX_SEQUENCE_LENGTH = 2048

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

print("Preprocessing samples")
def preprocess(example):
messages = [
[
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What does the image show?"}
]
}
],
]
return {
"text": processor.apply_chat_template(
messages,
add_generation_prompt=True,
),
}

ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
return processor(sample["image"], sample["text"], add_special_tokens=False, return_tensors="pt")


ds = ds.map(tokenize, remove_columns=ds.column_names)
print(ds)

print("Setting up quantization params")
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per channel via ptq
# * quantize the activations to fp8 with dynamic per token
#ignore=["re:.*lm_head", "re:model.vision_embed_tokens.*"]
#ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*", "re:language_model.*cross_attn.*"],
ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"]

recipe = [
# SmoothQuantModifier(smoothing_strength=0.8, ignore=ignore),
GPTQModifier(targets="Linear", scheme="W8A8", ignore=ignore),
]

save_name = model_id.split("/")[1] + "-W8A8"
save_path = os.path.join("./my_test/", save_name)
print("Starting quantization")
oneshot(
model=model,
tokenizer=model_id,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
trust_remote_code_model=True,
output_dir=save_path,
)

#processor.save_pretrained(save_path)

# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print("==========================================")
Loading
Loading