Skip to content

Commit

Permalink
Offline QuaRot Implementation (#1556)
Browse files Browse the repository at this point in the history
## Describe your changes
- Previous QuaRot pass is replaced with our own implementation of the
pass which only performs the offline weight rotation.
- The online hadamard rotation parts are not relevant to us since it
involves reimplementing the model architecture or updating them
dynamically to add the input/kv rotation functions. Moreover, these are
not compatible with onnx export.
- This pass does not do any quantization. The rotated output model
should be subsequently quantized using GPTQ and/or QDQ passes.
- All usage of quarot from the examples and cli are removed. New
examples and cli options will be added once E2E validation of Rotate ->
GPTQ -> QDQ workflows is complete.

## Checklist before requesting a review
- [ ] Add unit tests for this change.
- [ ] Make sure all tests can pass.
- [ ] Update documents if necessary.
- [ ] Lint and apply fixes to your code by running `lintrunner -a`
- [ ] Is this a user-facing change? If yes, give a description of this
change to be included in the release notes.
- [ ] Is this PR including examples changes? If yes, please remember to
update [example
documentation](https://github.com/microsoft/Olive/blob/main/docs/source/examples.md)
in a follow-up PR.

## (Optional) Issue link
  • Loading branch information
jambayk authored Jan 22, 2025
1 parent da3b967 commit ab72d69
Show file tree
Hide file tree
Showing 17 changed files with 4,556 additions and 403 deletions.
6 changes: 6 additions & 0 deletions .lintrunner.toml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ code = 'BLACK-ISORT'
include_patterns = [
'**/*.py'
]
exclude_patterns = [
'**/hadamard_utils.py'
]
command = [
'python',
'-m',
Expand All @@ -91,6 +94,9 @@ code = 'PYLINT'
include_patterns = [
'**/*.py'
]
exclude_patterns = [
'**/hadamard_utils.py'
]
command = [
'python',
'-m',
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ repos:
hooks:
- id: black
name: Format code
exclude: "hadamard_utils.py"
- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
Expand Down
3 changes: 1 addition & 2 deletions docs/source/how-to/cli/cli-quantize.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ Some methods require a GPU and/or a calibration dataset.
| ------ | ------------ | ------------ | ------------------ | ------------------ | ------------------- |
| AWQ | Activation-aware Weight Quantization (AWQ) creates 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. | ✔️ || PyTorch <br> HuggingFace | PyTorch |
| GPTQ | Generative Pre-trained Transformer Quantization (GPTQ) is a one-shot weight quantization method. You can quantize your favorite language model to 8, 4, 3 or even 2 bits. | ✔️ | ✔️ | PyTorch <br> HuggingFace | PyTorch |
| QuaRot | Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. | ✔️ | ✔️ | HuggingFace | PyTorch |
| bnb4 | Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7). ||| ONNX | ONNX |
| ONNX Dynamic | Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically. ||| ONNX | ONNX |
| INC Dynamic | Intel® Neural Compressor model compression tool. ||| ONNX | ONNX |
Expand All @@ -43,7 +42,7 @@ olive quantize \

## Quantization with ONNX Optimizations

As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.
As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.

You can use Olive's automatic optimizer (`auto-opt`) to create an optimized ONNX model from a quantized model:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,14 @@ Please refer to [AutoAWQQuantizer](awq_quantizer) for more details about the pas
```

## QuaRot
`QuaRot` is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314).
`QuaRot` is a technique that rotates the weights of a model to make them more conducive to quantization. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314) but only performs offline weight rotation. Can be followed by a pass such as GPTQ to quantize the rotated model weights.

This pass only supports HuggingFace transformer PyTorch models. Please refer to [QuaRot](quarot) for more details on the types of transformers models supported.

### Example Configuration
```json
{
"type": "QuaRot",
"w_rtn": true,
"rotate": true,
"w_bits": 4,
"a_bits": 4,
"k_bits": 4,
"v_bits": 4
"rotate_mode": "hadamard"
}
```
13 changes: 1 addition & 12 deletions examples/phi3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ You can use Olive CLI command to export, fine-tune, and optimize the model for a
olive auto-opt -m microsoft/Phi-3-mini-4k-instruct --precision int8
# To quantize the model
olive quantize -m microsoft/Phi-3-mini-4k-instruct --trust_remote_code --precision fp16 --implementation quarot
olive quantize -m microsoft/Phi-3-mini-4k-instruct --implementation gptq
# To tune ONNX session params
olive tune-session-params -m microsoft/Phi-3-mini-4k-instruct --io_bind --enable_cuda_graph
Expand Down Expand Up @@ -94,17 +94,6 @@ olive run [--config CONFIGURATION_FILE]
olive run --config phi3_run_mobile_int4.json
```

We also introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end.
Specific details about the algorithm can be found in the linked [paper](https://arxiv.org/pdf/2404.00456).

## Prerequisites
[QuaRot](https://github.com/microsoft/TransformerCompression/tree/quarot-main)

To run the workflow,
```bash
python phi3.py --quarot
```

### Get access to fine-tuning dataset
Get access to the following resources on Hugging Face Hub:
- [nampdn-ai/tiny-codes](https://huggingface.co/nampdn-ai/tiny-codes)
Expand Down
19 changes: 0 additions & 19 deletions examples/phi3/phi3.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,6 @@ def get_args(raw_args):
)

quant_group = parser.add_mutually_exclusive_group()
quant_group.add_argument(
"--quarot",
action="store_true",
help="Run QuaRot on a Hugging Face PyTorch model",
)
quant_group.add_argument(
"--awq",
action="store_true",
Expand Down Expand Up @@ -159,9 +154,6 @@ def main(raw_args=None):

olive_run(run_config)

if args.quarot:
return

if args.inference:
if not args.chat_template:
args.chat_template = (
Expand Down Expand Up @@ -211,17 +203,6 @@ def generate_config(args):

config_prefix = "phi3_run_"

if args.quarot:
template_json = use_passes(template_json, "quarot")
template_json["systems"]["local_system"]["accelerators"] = [
{"device": "GPU", "execution_providers": ["CUDAExecutionProvider"]}
]
new_json_file = f"{config_prefix}quarot.json"
with open(new_json_file, "w") as f:
json.dump(template_json, f, indent=4)

return new_json_file

# use aml instance of model
if args.source == "AzureML":
template_json["input_model"]["model_path"] = AML_MODEL_Path
Expand Down
10 changes: 0 additions & 10 deletions examples/phi3/phi3_template.json
Original file line number Diff line number Diff line change
Expand Up @@ -103,16 +103,6 @@
"execution_providers_list": [ "CUDAExecutionProvider" ],
"opt_level_list": [ 0, 1 ],
"execution_mode_list": [ 0, 1 ]
},
"quarot": {
"type": "QuaRot",
"w_rtn": true,
"rotate": true,
"w_bits": 4,
"a_bits": 4,
"k_bits": 4,
"v_bits": 4,
"calibration_data_config": "wikitext2_train"
}
},
"pass_flows": [ [ "<place_holder>" ] ],
Expand Down
40 changes: 4 additions & 36 deletions olive/cli/quantize.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,9 +69,6 @@ def register_subcommand(parser: ArgumentParser):
action="store_true",
help="Use QDQ encoding in ONNX model for the quantized nodes.",
)
sub_parser.add_argument(
"--quarot_rotate", action="store_true", help="Apply QuaRot/Hadamard rotation to the model."
)

add_dataset_options(sub_parser, required=False, include_train=False, include_eval=False)
add_remote_options(sub_parser)
Expand All @@ -96,9 +93,7 @@ def _get_run_config(self, tempdir: str) -> Dict[str, Any]:
if not self.args.precision:
self.args.precision = ALGORITHMS[self.args.algorithm][defaults_key]["precision"]

if self.args.algorithm in ["gptq", "rtn"] and self.args.implementation == "quarot":
self.args.precision = "int16"
elif self.args.algorithm == "rtn" and self.args.precision == "nf4":
if self.args.algorithm == "rtn" and self.args.precision == "nf4":
self.args.implementation = "bnb4"

if self.args.enable_qdq_encoding and self.args.implementation != "matmul4":
Expand Down Expand Up @@ -130,10 +125,6 @@ def _get_run_config(self, tempdir: str) -> Dict[str, Any]:
(("passes", "awq", "w_bit"), precision),
(("passes", "gptq", "bits"), precision),
(("passes", "bnb4", "quant_type"), precision),
(("passes", "quarot", "w_bits"), precision),
(("passes", "quarot", "rotate"), self.args.quarot_rotate),
(("passes", "quarot", "w_rtn"), self.args.algorithm == "rtn"),
(("passes", "quarot", "w_gptq"), self.args.algorithm == "gptq"),
(("passes", "nvmo", "precision"), precision),
(("passes", "nvmo", "algorithm"), self.args.algorithm.upper()),
(("passes", "onnx_dynamic", "weight_type"), precision),
Expand All @@ -154,9 +145,6 @@ def run(self):
if ("gptq" in self.args.algorithm) and (not self.args.data_name):
raise ValueError("data_name is required to use gptq.")

if ("quarot" in self.args.algorithm) and (not self.args.data_name) and (self.args.quarot_strategy == "gptq"):
raise ValueError("data_name is required to quantize weights using gptq.")

with tempfile.TemporaryDirectory(prefix="olive-cli-tmp-", dir=self.args.output_path) as tempdir:
run_config = self._get_run_config(tempdir)
olive_run(run_config)
Expand Down Expand Up @@ -184,14 +172,6 @@ def run(self):
# Pytorch algorithms
"awq": {"type": "AutoAWQQuantizer", "w_bit": 4},
"gptq": {"type": "GptqQuantizer", "bits": 4, "data_config": "default_data_config"},
"quarot": {
"type": "QuaRot",
"w_bits": 16,
"w_rtn": False,
"w_gptq": False,
"rotate": False,
"calibration_data_config": "default_data_config",
},
# Onnx algorithms
"bnb4": {"type": "OnnxBnb4Quantization", "quant_type": "nf4"},
"matmul4": {"type": "OnnxMatMul4Quantizer", "accuracy_level": 4},
Expand Down Expand Up @@ -219,14 +199,14 @@ def run(self):
"description": "(HfModel, OnnxModel) WOQ with AWQ.",
},
"gptq": {
"implementations": ["gptq", "quarot", "matmul4", "inc_static", "inc_dynamic"],
"implementations": ["gptq", "matmul4", "inc_static", "inc_dynamic"],
"hf_model_defaults": {"implementation": "gptq", "precision": "int4"},
"onnx_model_defaults": {"implementation": "matmul4", "precision": "int4"},
"description": "(HfModel, OnnxModel) WOQ with GPTQ.",
},
"rtn": {
"implementations": ["quarot", "bnb4", "matmul4"],
"hf_model_defaults": {"implementation": "quarot", "precision": "int16"},
"implementations": ["bnb4", "matmul4"],
"hf_model_defaults": {"implementation": None, "precision": None},
"onnx_model_defaults": {"implementation": "onnx_static", "precision": "int8"},
"description": "(HfModel, OnnxModel) WOQ with RTN.",
},
Expand Down Expand Up @@ -275,18 +255,6 @@ def run(self):
"uint16": 16,
},
},
"quarot": {
"name": "QuaRot/Hadamard rotation",
"supported_precisions": [],
"precision_mapping": {
"int4": 4,
"int8": 8,
"int16": 16,
"uint4": 4,
"uint8": 8,
"uint16": 16,
},
},
"bnb4": {
"name": "Bits-n-Bytes",
"supported_precisions": ["fp4", "nf4"],
Expand Down
10 changes: 10 additions & 0 deletions olive/common/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# Licensed under the MIT License.
# --------------------------------------------------------------------------
import codecs
import gc
import hashlib
import inspect
import io
Expand Down Expand Up @@ -659,3 +660,12 @@ def load_weights(path: Union[str, Path], file_format: Optional[WeightsFileFormat
def unescaped_str(arg_str):
"""Decode strings without escaping."""
return codecs.decode(arg_str, "unicode_escape")


def cleanup_memory():
"""Cleanup memory by running garbage collection and emptying CUDA cache."""
import torch

gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
13 changes: 6 additions & 7 deletions olive/olive_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,12 @@
"supported_precisions": [ "*" ],
"module_dependencies": [ "pytorch-lightning" ]
},
"QuaRot": {
"module_path": "olive.passes.pytorch.rotate.QuaRot",
"supported_providers": [ "*" ],
"supported_accelerators": [ "*" ],
"supported_precisions": [ "*" ]
},
"SparseGPT": {
"module_path": "olive.passes.pytorch.sparsegpt.SparseGPT",
"supported_providers": [ "*" ],
Expand All @@ -297,13 +303,6 @@
"supported_accelerators": [ "*" ],
"supported_precisions": [ "*" ]
},
"QuaRot": {
"module_path": "olive.passes.pytorch.quarot.QuaRot",
"supported_providers": [ "CPUExecutionProvider" ],
"supported_accelerators": [ "cpu" ],
"supported_precisions": [ "int4", "int8", "int16", "uint4", "uint8", "uint16" ],
"extra_dependencies": [ "flash-attn" ]
},
"TorchTRTConversion": {
"module_path": "olive.passes.pytorch.torch_trt_conversion.TorchTRTConversion",
"supported_providers": [ "*" ],
Expand Down
Loading

0 comments on commit ab72d69

Please sign in to comment.