Update docs from ab72d69

microsoft · Jan 22, 2025 · 36d7f60 · 36d7f60
1 parent fb7eb56
commit 36d7f60
Show file tree

Hide file tree

Showing 9 changed files with 139 additions and 541 deletions.
diff --git a/_sources/how-to/cli/cli-quantize.md.txt b/_sources/how-to/cli/cli-quantize.md.txt
@@ -16,7 +16,6 @@ Some methods require a GPU and/or a calibration dataset.
 | ------ | ------------ | ------------ | ------------------ | ------------------ | ------------------- |
 | AWQ | Activation-aware Weight Quantization (AWQ) creates 4-bit quantized models and it speeds up models by 3x and reduces memory requirements by 3x compared to FP16.  | ✔️ | ❌ | PyTorch <br> HuggingFace | PyTorch |
 | GPTQ | Generative Pre-trained Transformer Quantization (GPTQ) is a one-shot weight quantization method. You can quantize your favorite language model to 8, 4, 3 or even 2 bits.  | ✔️ | ✔️  | PyTorch <br> HuggingFace |  PyTorch  |
-| QuaRot | Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model.  | ✔️ | ✔️  | HuggingFace |  PyTorch  |
 | bnb4 | Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7). | ❌ | ❌ | ONNX | ONNX |
 | ONNX Dynamic | Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically. | ❌ | ❌ | ONNX | ONNX |
 | INC Dynamic | Intel® Neural Compressor model compression tool.  | ❌ | ❌ | ONNX | ONNX |
@@ -43,7 +42,7 @@ olive quantize \
 
 ## Quantization with ONNX Optimizations
 
-As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.
+As articulated in [Supported quantization techniques](#supported-quantization-techniques), you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.
 
 You can use Olive's automatic optimizer (`auto-opt`) to create an optimized ONNX model from a quantized model:
 

diff --git a/_sources/how-to/configure-workflows/pass/quantization-pytorch.md.txt b/_sources/how-to/configure-workflows/pass/quantization-pytorch.md.txt
@@ -38,19 +38,14 @@ Please refer to [AutoAWQQuantizer](awq_quantizer) for more details about the pas
 ```
 
 ## QuaRot
-`QuaRot` is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314).
+`QuaRot` is a technique that rotates the weights of a model to make them more conducive to quantization. It is based on the [QuaRot paper](https://arxiv.org/abs/2305.14314) but only performs offline weight rotation. Can be followed by a pass such as GPTQ to quantize the rotated model weights.
 
 This pass only supports HuggingFace transformer PyTorch models. Please refer to [QuaRot](quarot) for more details on the types of transformers models supported.
 
 ### Example Configuration
 ```json
 {
     "type": "QuaRot",
-    "w_rtn": true,
-    "rotate": true,
-    "w_bits": 4,
-    "a_bits": 4,
-    "k_bits": 4,
-    "v_bits": 4
+    "rotate_mode": "hadamard"
 }
 ```
diff --git a/genindex.html b/genindex.html
diff --git a/how-to/cli/cli-quantize.html b/how-to/cli/cli-quantize.html
@@ -468,35 +468,28 @@ <h2>Supported quantization techniques<a class="headerlink" href="#supported-quan
 <td><p>PyTorch <br> HuggingFace</p></td>
 <td><p>PyTorch</p></td>
 </tr>
-<tr class="row-even"><td><p>QuaRot</p></td>
-<td><p>Quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model.</p></td>
-<td><p>✔️</p></td>
-<td><p>✔️</p></td>
-<td><p>HuggingFace</p></td>
-<td><p>PyTorch</p></td>
-</tr>
-<tr class="row-odd"><td><p>bnb4</p></td>
+<tr class="row-even"><td><p>bnb4</p></td>
 <td><p>Is a MatMul with weight quantized with N bits (e.g., 2, 3, 4, 5, 6, 7).</p></td>
 <td><p>❌</p></td>
 <td><p>❌</p></td>
 <td><p>ONNX</p></td>
 <td><p>ONNX</p></td>
 </tr>
-<tr class="row-even"><td><p>ONNX Dynamic</p></td>
+<tr class="row-odd"><td><p>ONNX Dynamic</p></td>
 <td><p>Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically.</p></td>
 <td><p>❌</p></td>
 <td><p>❌</p></td>
 <td><p>ONNX</p></td>
 <td><p>ONNX</p></td>
 </tr>
-<tr class="row-odd"><td><p>INC Dynamic</p></td>
+<tr class="row-even"><td><p>INC Dynamic</p></td>
 <td><p>Intel® Neural Compressor model compression tool.</p></td>
 <td><p>❌</p></td>
 <td><p>❌</p></td>
 <td><p>ONNX</p></td>
 <td><p>ONNX</p></td>
 </tr>
-<tr class="row-even"><td><p>NVMO</p></td>
+<tr class="row-odd"><td><p>NVMO</p></td>
 <td><p>NVIDIA TensorRT Model Optimizer is a library comprising state-of-the-art model optimization techniques including quantization, sparsity, distillation, and pruning to compress models.</p></td>
 <td><p>❌</p></td>
 <td><p>❌</p></td>
@@ -527,7 +520,7 @@ <h2><svg version="1.1" width="1.0em" height="1.0em" class="sd-octicon sd-octicon
 </section>
 <section id="quantization-with-onnx-optimizations">
 <h2>Quantization with ONNX Optimizations<a class="headerlink" href="#quantization-with-onnx-optimizations" title="Link to this heading">#</a></h2>
-<p>As articulated in <a class="reference internal" href="#supported-quantization-techniques"><span class="xref myst">Supported quantization techniques</span></a>, you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ/QuaRot quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.</p>
+<p>As articulated in <a class="reference internal" href="#supported-quantization-techniques"><span class="xref myst">Supported quantization techniques</span></a>, you may wish to take the PyTorch/Hugging Face output of AWQ/GPTQ quantization methods and convert into an optimized ONNX format so that you can inference using the ONNX runtime.</p>
 <p>You can use Olive’s automatic optimizer (<code class="docutils literal notranslate"><span class="pre">auto-opt</span></code>) to create an optimized ONNX model from a quantized model:</p>
 <div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Step 1: AWQ (will output a PyTorch model)</span>
 olive<span class="w"> </span>quantize<span class="w"> </span><span class="se">\</span>

diff --git a/how-to/configure-workflows/pass/quantization-pytorch.html b/how-to/configure-workflows/pass/quantization-pytorch.html
@@ -473,18 +473,13 @@ <h3>Example Configuration<a class="headerlink" href="#id1" title="Link to this h
 </section>
 <section id="quarot">
 <h2>QuaRot<a class="headerlink" href="#quarot" title="Link to this heading">#</a></h2>
-<p><code class="docutils literal notranslate"><span class="pre">QuaRot</span></code> is a quantization technique that combines quantization and rotation to reduce the number of bits required to represent the weights of a model. It is based on the <a class="reference external" href="https://arxiv.org/abs/2305.14314">QuaRot paper</a>.</p>
+<p><code class="docutils literal notranslate"><span class="pre">QuaRot</span></code> is a technique that rotates the weights of a model to make them more conducive to quantization. It is based on the <a class="reference external" href="https://arxiv.org/abs/2305.14314">QuaRot paper</a> but only performs offline weight rotation. Can be followed by a pass such as GPTQ to quantize the rotated model weights.</p>
 <p>This pass only supports HuggingFace transformer PyTorch models. Please refer to <a class="reference internal" href="../../../reference/pass.html#quarot"><span class="std std-ref">QuaRot</span></a> for more details on the types of transformers models supported.</p>
 <section id="id2">
 <h3>Example Configuration<a class="headerlink" href="#id2" title="Link to this heading">#</a></h3>
 <div class="highlight-json notranslate"><div class="highlight"><pre><span></span><span class="p">{</span>
 <span class="w">    </span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;QuaRot&quot;</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;w_rtn&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;rotate&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;w_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;a_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;k_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span>
-<span class="w">    </span><span class="nt">&quot;v_bits&quot;</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span>
+<span class="w">    </span><span class="nt">&quot;rotate_mode&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;hadamard&quot;</span>
 <span class="p">}</span>
 </pre></div>
 </div>

diff --git a/objects.inv b/objects.inv
diff --git a/reference/cli.html b/reference/cli.html
@@ -773,10 +773,9 @@ <h1>Quantization<a class="headerlink" href="#quantization" title="Link to this h
                       <span class="p">[</span><span class="o">--</span><span class="n">is_generative_model</span> <span class="n">IS_GENERATIVE_MODEL</span><span class="p">]</span>
                       <span class="p">[</span><span class="o">-</span><span class="n">o</span> <span class="n">OUTPUT_PATH</span><span class="p">]</span> <span class="o">--</span><span class="n">algorithm</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">dynamic</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">hqq</span><span class="p">,</span><span class="n">rtn</span><span class="p">}</span>
                       <span class="p">[</span><span class="o">--</span><span class="n">precision</span> <span class="p">{</span><span class="n">int4</span><span class="p">,</span><span class="n">int8</span><span class="p">,</span><span class="n">int16</span><span class="p">,</span><span class="n">uint4</span><span class="p">,</span><span class="n">uint8</span><span class="p">,</span><span class="n">uint16</span><span class="p">,</span><span class="n">fp4</span><span class="p">,</span><span class="n">fp8</span><span class="p">,</span><span class="n">fp16</span><span class="p">,</span><span class="n">nf4</span><span class="p">}]</span>
-                      <span class="p">[</span><span class="o">--</span><span class="n">implementation</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">bnb4</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">inc_dynamic</span><span class="p">,</span><span class="n">matmul4</span><span class="p">,</span><span class="n">mnb_to_qdq</span><span class="p">,</span><span class="n">nvmo</span><span class="p">,</span><span class="n">onnx_dynamic</span><span class="p">,</span><span class="n">quarot</span><span class="p">}]</span>
-                      <span class="p">[</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">qdq</span><span class="o">-</span><span class="n">encoding</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">quarot_rotate</span><span class="p">]</span> <span class="p">[</span><span class="o">-</span><span class="n">d</span> <span class="n">DATA_NAME</span><span class="p">]</span>
-                      <span class="p">[</span><span class="o">--</span><span class="n">subset</span> <span class="n">SUBSET</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">split</span> <span class="n">SPLIT</span><span class="p">]</span>
-                      <span class="p">[</span><span class="o">--</span><span class="n">data_files</span> <span class="n">DATA_FILES</span><span class="p">]</span>
+                      <span class="p">[</span><span class="o">--</span><span class="n">implementation</span> <span class="p">{</span><span class="n">awq</span><span class="p">,</span><span class="n">bnb4</span><span class="p">,</span><span class="n">gptq</span><span class="p">,</span><span class="n">inc_dynamic</span><span class="p">,</span><span class="n">matmul4</span><span class="p">,</span><span class="n">mnb_to_qdq</span><span class="p">,</span><span class="n">nvmo</span><span class="p">,</span><span class="n">onnx_dynamic</span><span class="p">}]</span>
+                      <span class="p">[</span><span class="o">--</span><span class="n">enable</span><span class="o">-</span><span class="n">qdq</span><span class="o">-</span><span class="n">encoding</span><span class="p">]</span> <span class="p">[</span><span class="o">-</span><span class="n">d</span> <span class="n">DATA_NAME</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">subset</span> <span class="n">SUBSET</span><span class="p">]</span>
+                      <span class="p">[</span><span class="o">--</span><span class="n">split</span> <span class="n">SPLIT</span><span class="p">]</span> <span class="p">[</span><span class="o">--</span><span class="n">data_files</span> <span class="n">DATA_FILES</span><span class="p">]</span>
                       <span class="p">[</span><span class="o">--</span><span class="n">text_field</span> <span class="n">TEXT_FIELD</span> <span class="o">|</span> <span class="o">--</span><span class="n">text_template</span> <span class="n">TEXT_TEMPLATE</span><span class="p">]</span>
                       <span class="p">[</span><span class="o">--</span><span class="n">max_seq_len</span> <span class="n">MAX_SEQ_LEN</span><span class="p">]</span>
                       <span class="p">[</span><span class="o">--</span><span class="n">add_special_tokens</span> <span class="n">ADD_SPECIAL_TOKENS</span><span class="p">]</span>
@@ -830,17 +829,13 @@ <h2>Named Arguments<a class="headerlink" href="#named-arguments" title="Link to
 <p>Default: “int4”</p>
 </dd>
 <dt><kbd>--implementation</kbd></dt>
-<dd><p>Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic, quarot</p>
+<dd><p>Possible choices: awq, bnb4, gptq, inc_dynamic, matmul4, mnb_to_qdq, nvmo, onnx_dynamic</p>
 <p>The specific implementation of quantization algorithms to use.</p>
 </dd>
 <dt><kbd>--enable-qdq-encoding</kbd></dt>
 <dd><p>Use QDQ encoding in ONNX model for the quantized nodes.</p>
 <p>Default: False</p>
 </dd>
-<dt><kbd>--quarot_rotate</kbd></dt>
-<dd><p>Apply QuaRot/Hadamard rotation to the model.</p>
-<p>Default: False</p>
-</dd>
 <dt><kbd>-d, --data_name</kbd></dt>
 <dd><p>The dataset name.</p>
 </dd>