Skip to content

Commit

Permalink
Update docs from ab72d69
Browse files Browse the repository at this point in the history
  • Loading branch information
olivedevteam committed Jan 22, 2025
1 parent fb7eb56 commit 36d7f60
Show file tree
Hide file tree
Showing 9 changed files with 139 additions and 541 deletions.
3 changes: 1 addition & 2 deletions _sources/how-to/cli/cli-quantize.md.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ Some methods require a GPU and/or a calibration dataset.
| ------ | ------------ | ------------ | ------------------ | ------------------ | ------------------- |
| AWQ | Activation-aware Weight Quantization (AWQ) creates 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16. | ✔️ | ❌ | PyTorch <br> HuggingFace | PyTorch |
| GPTQ | Generative Pre-trained Transformer Quantization (GPTQ) is a one-shot weight quantization method. You can quantize your favorite language model to 8, 4, 3 or even 2 bits. | ✔️ | ✔️ | PyTorch <br> HuggingFace | PyTorch |
| QuaRot | Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. | ✔️ | ✔️ | HuggingFace | PyTorch |
| bnb4 | Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7). | ❌ | ❌ | ONNX | ONNX |
| ONNX Dynamic | Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically. | ❌ | ❌ | ONNX | ONNX |
| INC Dynamic | Intel® Neural Compressor model compression tool. | ❌ | ❌ | ONNX | ONNX |
Expand All @@ -43,7 +42,7 @@ olive quantize \

## Quantization with ONNX Optimizations

As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.
As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.

You can use Olive's automatic optimizer (`auto-opt`) to create an optimized ONNX model from a quantized model:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,19 +38,14 @@ Please refer to [AutoAWQQuantizer](awq_quantizer) for more details about the pas
```

## QuaRot
`QuaRot` is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314).
`QuaRot` is a technique that rotates the weights of a model to make them more conducive to quantization. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314) but only performs offline weight rotation. Can be followed by a pass such as GPTQ to quantize the rotated model weights.

This pass only supports HuggingFace transformer PyTorch models. Please refer to [QuaRot](quarot) for more details on the types of transformers models supported.

### Example Configuration
```json
{
"type": "QuaRot",
"w_rtn": true,
"rotate": true,
"w_bits": 4,
"a_bits": 4,
"k_bits": 4,
"v_bits": 4
"rotate_mode": "hadamard"
}
```
293 changes: 61 additions & 232 deletions genindex.html

Large diffs are not rendered by default.

17 changes: 5 additions & 12 deletions how-to/cli/cli-quantize.html
Original file line number Diff line number Diff line change
Expand Up @@ -468,35 +468,28 @@ <h2>Supported quantization techniques<a class="headerlink" href="#supported-quan
<td><p>PyTorch <br> HuggingFace</p></td>
<td><p>PyTorch</p></td>
</tr>
<tr class="row-even"><td><p>QuaRot</p></td>
<td><p>Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model.</p></td>
<td><p>✔️</p></td>
<td><p>✔️</p></td>
<td><p>HuggingFace</p></td>
<td><p>PyTorch</p></td>
</tr>
<tr class="row-odd"><td><p>bnb4</p></td>
<tr class="row-even"><td><p>bnb4</p></td>
<td><p>Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7).</p></td>
<td><p></p></td>
<td><p></p></td>
<td><p>ONNX</p></td>
<td><p>ONNX</p></td>
</tr>
<tr class="row-even"><td><p>ONNX Dynamic</p></td>
<tr class="row-odd"><td><p>ONNX Dynamic</p></td>
<td><p>Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically.</p></td>
<td><p></p></td>
<td><p></p></td>
<td><p>ONNX</p></td>
<td><p>ONNX</p></td>
</tr>
<tr class="row-odd"><td><p>INC Dynamic</p></td>
<tr class="row-even"><td><p>INC Dynamic</p></td>
<td><p>Intel® Neural Compressor model compression tool.</p></td>
<td><p></p></td>
<td><p></p></td>
<td><p>ONNX</p></td>
<td><p>ONNX</p></td>
</tr>
<tr class="row-even"><td><p>NVMO</p></td>
<tr class="row-odd"><td><p>NVMO</p></td>
<td><p>NVIDIA TensorRT Model Optimizer is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models.</p></td>
<td><p></p></td>
<td><p></p></td>
Expand Down Expand Up @@ -527,7 +520,7 @@ <h2><svg version="1.1" width="1.0em" height="1.0em" class="sd-octicon sd-octicon
</section>
<section id="quantization-with-onnx-optimizations">
<h2>Quantization with ONNX Optimizations<a class="headerlink" href="#quantization-with-onnx-optimizations" title="Link to this heading">#</a></h2>
<p>As articulated in <a class="reference internal" href="#supported-quantization-techniques"><span class="xref myst">Supported quantization techniques</span></a>, you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.</p>
<p>As articulated in <a class="reference internal" href="#supported-quantization-techniques"><span class="xref myst">Supported quantization techniques</span></a>, you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.</p>
<p>You can use Olive’s automatic optimizer (<code class="docutils literal notranslate"><span class="pre">auto-opt</span></code>) to create an optimized ONNX model from a quantized model:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Step 1: AWQ (will output a PyTorch model)</span>
olive<span class="w"> </span>quantize<span class="w"> </span><span class="se">\</span>
Expand Down
9 changes: 2 additions & 7 deletions how-to/configure-workflows/pass/quantization-pytorch.html
Original file line number Diff line number Diff line change
Expand Up @@ -473,18 +473,13 @@ <h3>Example Configuration<a class="headerlink" href="#id1" title="Link to this h
</section>
<section id="quarot">
<h2>QuaRot<a class="headerlink" href="#quarot" title="Link to this heading">#</a></h2>
<p><code class="docutils literal notranslate"><span class="pre">QuaRot</span></code> is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the <a class="reference external" href="https://arxiv.org/abs/2305.14314">QuaRot paper</a>.</p>
<p><code class="docutils literal notranslate"><span class="pre">QuaRot</span></code> is a technique that rotates the weights of a model to make them more conducive to quantization. It is based on the <a class="reference external" href="https://arxiv.org/abs/2305.14314">QuaRot paper</a> but only performs offline weight rotation. Can be followed by a pass such as GPTQ to quantize the rotated model weights.</p>
<p>This pass only supports HuggingFace transformer PyTorch models. Please refer to <a class="reference internal" href="../../../reference/pass.html#quarot"><span class="std std-ref">QuaRot</span></a> for more details on the types of transformers models supported.</p>
<section id="id2">
<h3>Example Configuration<a class="headerlink" href="#id2" title="Link to this heading">#</a></h3>
<div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
<span class="w"> </span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;QuaRot&quot;</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;w_rtn&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;rotate&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;w_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;a_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;k_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;v_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span>
<span class="w"> </span><span class="nt">&quot;rotate_mode&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;hadamard&quot;</span>
<span class="p">}</span>
</pre></div>
</div>
Expand Down
Binary file modified objects.inv
Binary file not shown.
13 changes: 4 additions & 9 deletions reference/cli.html
Original file line number Diff line number Diff line change
Expand Up @@ -773,10 +773,9 @@ <h1>Quantization<a class="headerlink" href="#quantization" title="Link to this h
<span class="p">[</span><span class="o">--</span><span class="n">is_generative_model</span> <span class="n">IS_GENERATIVE_MODEL</span><span class="p">]</span>
<span class="p">[</span><span class="o">-</span><span class="n">o</span> <span class="n">OUTPUT_PATH</span><span class="p">]</span> <span class="o">--</span><span class="n">algorithm</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">dynamic</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">hqq</span><span class="p">,</span><span class="n">rtn</span><span class="p">}</span>
<span class="p">[</span><span class="o">--</span><span class="n">precision</span> <span class="p">{</span><span class="n">int4</span><span class="p">,</span><span class="n">int8</span><span class="p">,</span><span class="n">int16</span><span class="p">,</span><span class="n">uint4</span><span class="p">,</span><span class="n">uint8</span><span class="p">,</span><span class="n">uint16</span><span class="p">,</span><span class="n">fp4</span><span class="p">,</span><span class="n">fp8</span><span class="p">,</span><span class="n">fp16</span><span class="p">,</span><span class="n">nf4</span><span class="p">}]</span>
<span class="p">[</span><span class="o">--</span><span class="n">implementation</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">bnb4</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">inc_dynamic</span><span class="p">,</span><span class="n">matmul4</span><span class="p">,</span><span class="n">mnb_to_qdq</span><span class="p">,</span><span class="n">nvmo</span><span class="p">,</span><span class="n">onnx_dynamic</span><span class="p">,</span><span class="n">quarot</span><span class="p">}]</span>
<span class="p">[</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">qdq</span><span class="o">-</span><span class="n">encoding</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">quarot_rotate</span><span class="p">]</span> <span class="p">[</span><span class="o">-</span><span class="n">d</span> <span class="n">DATA_NAME</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">subset</span> <span class="n">SUBSET</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">split</span> <span class="n">SPLIT</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">data_files</span> <span class="n">DATA_FILES</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">implementation</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">bnb4</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">inc_dynamic</span><span class="p">,</span><span class="n">matmul4</span><span class="p">,</span><span class="n">mnb_to_qdq</span><span class="p">,</span><span class="n">nvmo</span><span class="p">,</span><span class="n">onnx_dynamic</span><span class="p">}]</span>
<span class="p">[</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">qdq</span><span class="o">-</span><span class="n">encoding</span><span class="p">]</span> <span class="p">[</span><span class="o">-</span><span class="n">d</span> <span class="n">DATA_NAME</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">subset</span> <span class="n">SUBSET</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">split</span> <span class="n">SPLIT</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">data_files</span> <span class="n">DATA_FILES</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">text_field</span> <span class="n">TEXT_FIELD</span> <span class="o">|</span> <span class="o">--</span><span class="n">text_template</span> <span class="n">TEXT_TEMPLATE</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">max_seq_len</span> <span class="n">MAX_SEQ_LEN</span><span class="p">]</span>
<span class="p">[</span><span class="o">--</span><span class="n">add_special_tokens</span> <span class="n">ADD_SPECIAL_TOKENS</span><span class="p">]</span>
Expand Down Expand Up @@ -830,17 +829,13 @@ <h2>Named Arguments<a class="headerlink" href="#named-arguments" title="Link to
<p>Default: “int4”</p>
</dd>
<dt><kbd>--implementation</kbd></dt>
<dd><p>Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic, quarot</p>
<dd><p>Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic</p>
<p>The specific implementation of quantization algorithms to use.</p>
</dd>
<dt><kbd>--enable-qdq-encoding</kbd></dt>
<dd><p>Use QDQ encoding in ONNX model for the quantized nodes.</p>
<p>Default: False</p>
</dd>
<dt><kbd>--quarot_rotate</kbd></dt>
<dd><p>Apply QuaRot/Hadamard rotation to the model.</p>
<p>Default: False</p>
</dd>
<dt><kbd>-d, --data_name</kbd></dt>
<dd><p>The dataset name.</p>
</dd>
Expand Down
Loading

0 comments on commit 36d7f60

Please sign in to comment.