onnx · jeremyfowers · Feb 10, 2025 · Feb 7, 2025 · Feb 7, 2025 · Feb 7, 2025
diff --git a/.github/workflows/test_quark.yml b/.github/workflows/test_quark.yml
@@ -0,0 +1,52 @@
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: Test Lemonade with Quark Quantization
+
+on:
+  push:
+    branches: ["main"]
+  pull_request:
+    branches: ["main"]
+
+permissions:
+  contents: read
+
+jobs:
+  make-quark-lemonade:
+    env:
+        LEMONADE_CI_MODE: "True"
+    runs-on: windows-latest
+    steps:
+      - uses: actions/checkout@v3
+      - name: Set up Miniconda with 64-bit Python
+        uses: conda-incubator/setup-miniconda@v2
+        with:
+          miniconda-version: "latest"
+          activate-environment: lemon
+          python-version: "3.10"
+          run-post: "false"
+      - name: Install dependencies
+        shell: bash -el {0}
+        run: |
+          python -m pip install --upgrade pip
+          conda install pylint
+          python -m pip check
+          pip install -e .[llm-oga-cpu]
+          lemonade-install --quark 0.6.0
+      - name: Lint with Black
+        uses: psf/black@stable
+        with:
+          options: "--check --verbose"
+          src: "./src"
+      - name: Lint with PyLint
+        shell: bash -el {0}
+        run: |
+          pylint src/lemonade/tools/quark --rcfile .pylintrc --disable E0401 
+      - name: Run lemonade tests
+        shell: bash -el {0}
+        env:
+          HF_TOKEN: "${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}" # Required by OGA model_builder in OGA 0.4.0 but not future versions
+        run: |
+          python test/lemonade/quark_api.py
+
diff --git a/docs/lemonade/getting_started.md b/docs/lemonade/getting_started.md
@@ -1,54 +1,125 @@
 # Lemonade
 
 Welcome to the project page for `lemonade` the Turnkey LLM Aide!
-Contents:
 
-1. [Getting Started](#getting-started)
-1. [Install Specialized Tools](#install-specialized-tools)
-    - [OnnxRuntime GenAI](#install-onnxruntime-genai)
-    - [RyzenAI NPU for PyTorch](#install-ryzenai-npu-for-pytorch)
+1. [Install](#install)
+1. [CLI Commands](#cli-commands)
+    - [Syntax](#syntax)
+    - [Chatting](#chatting)
+    - [Accuracy](#accuracy)
+    - [Benchmarking](#benchmarking)
+    - [Memory Usage](#memory-usage)
+    - [Serving](#serving)
+1. [API Overview](#api)
 1. [Code Organization](#code-organization)
 1. [Contributing](#contributing)
 
-# Getting Started
 
-`lemonade` introduces a brand new set of LLM-focused tools. 
+# Install
 
-## Install
+You can quickly get started with `lemonade` by installing the `turnkeyml` [PyPI package](#from-pypi) with the appropriate extras for your backend, or you can [install from source](#from-source-code) by cloning and installing this repository.
+
+## From PyPI
+
+To install `lemonade` from PyPI:
+
+1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment.
+    ```bash
+    conda create -n lemon python=3.10
+    cond activate lemon
+    ```
+
+3. Install lemonade for you backend of choice: 
+    - [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md): 
+        ```bash
+            pip install -e turnkeyml[llm-oga-cpu]
+        ```
+    - [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
+        > Note: Requires Windows and a DirectML-compatible iGPU.
+        ```bash
+            pip install -e turnkeyml[llm-oga-igpu]
+        ```
+    - OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
+        > Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
+        > - Install the [Ryzen AI driver >= 32.0.203.237](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers) (you can check your driver version under Device Manager > Neural Processors).
+        > - Visit the [AMD Hugging Face page](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5) for supported checkpoints.
+        ```bash
+            pip install -e turnkeyml[llm-oga-hybrid]
+            lemonade-install --ryzenai hybrid
+        ```
+    - Hugging Face (PyTorch) LLMs for CPU backend:
+        ```bash
+            pip install -e turnkeyml[llm]
+        ```
+    - llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md).
+
+4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below.
+
+
+## From Source Code
+
+To install `lemonade` from source code:
 
 1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
-1. `cd turnkeyml` (where `turnkeyml` is the repo root of your TurnkeyML clone)
+1. `cd turnkeyml` (where `turnkeyml` is the repo root of your clone)
     - Note: be sure to run these installation instructions from the repo root.
-1. Create and activate a conda environment:
-    1. `conda create -n lemon python=3.10`
-    1. `conda activate lemon`
-1. Install lemonade: `pip install -e .[llm]`
-    - or `pip install -e .[llm-oga-igpu]` if you want to use `onnxruntime-genai` (see [OGA](#install-onnxruntime-genai))
-1. `lemonade -h` to explore the LLM tools
+1. Follow the same instructions as in the [PyPI installation](#from-pypi), except replace the `turnkeyml` with a `.`.
+    - For example: `pip install -e .[llm-oga-igpu]`
+
+# CLI Commands
+
+The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.  
+
+Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.
+
+You can read each command out loud to understand what it is doing. For example, a command like this:
+
+```bash
+lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
+```
+
+Can be read like this:
 
-## Syntax
+> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), on to the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.
+
+The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.
 
-The `lemonade` CLI uses the same style of syntax as `turnkey`, but with a new set of LLM-specific tools. You can read about that syntax [here](https://github.com/onnx/turnkeyml#how-it-works).
 
 ## Chatting
 
 To chat with your LLM try:
 
-`lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"`
+OGA iGPU:
+```bash
+    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
+```
+
+Hugging Face:
+```bash
+    lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"
+```
 
-The LLM will run on CPU with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
+The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
 
-You can also replace the `facebook/opt-125m` with any Huggingface checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
+You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
 
-You can also set the `--device` argument in `huggingface-load` to load your LLM on a different device.
+You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.
 
 Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.
 
 ## Accuracy
 
 To measure the accuracy of an LLM using MMLU, try this:
 
-`lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management`
+OGA iGPU:
+```bash
+    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management
+```
+
+Hugging Face:
+```bash
+    lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
+```
 
 That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`.
 
@@ -58,18 +129,34 @@ You can run the full suite of MMLU subjects by omitting the `--test` argument. Y
 
 To measure the time-to-first-token and tokens/second of an LLM, try this:
 
-`lemonade -i facebook/opt-125m huggingface-load huggingface-bench`
+OGA iGPU:
+```bash
+    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
+```
+
+Hugging Face:
+```bash
+    lemonade -i facebook/opt-125m huggingface-load huggingface-bench
+```
 
 That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
 
-The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade huggingface-bench -h`.
+The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`.
 
 ## Memory Usage
 
-The peak memory used by the lemonade build is captured in the build output.  To capture more granular
+The peak memory used by the `lemonade` build is captured in the build output. To capture more granular
 memory usage information, use the `--memory` flag.  For example:
 
-`lemonade -i facebook/opt-125m --memory huggingface-load huggingface-bench`
+OGA iGPU:
+```bash
+    lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
+```
+
+Hugging Face:
+```bash
+    lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench
+```
 
 In this case a `memory_usage.png` file will be generated and stored in the build folder.  This file
 contains a figure plotting the memory usage over the build time.  Learn more by running `lemonade -h`.
@@ -78,70 +165,66 @@ contains a figure plotting the memory usage over the build time.  Learn more by
 
 You can launch a WebSocket server for your LLM with:
 
-`lemonade -i facebook/opt-125m huggingface-load serve`
-
-Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.
-
-Note that the `llm-prompt`, `accuracy-mmlu`, and `serve` tools can all be used with other model-loading tools, for example `onnxruntime-genai` or `ryzenai-transformers`. See [Install Specialized Tools](#install-specialized-tools) for details.
-
-## API
-
-Lemonade is also available via API. Here's a quick example of how to benchmark an LLM:
-
-```python
-import lemonade.tools.torch_llm as tl
-import lemonade.tools.chat as cl
-from turnkeyml.state import State
-
-state = State(cache_dir="cache", build_name="test")
-
-state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
-state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)
+OGA iGPU:
+```bash
+    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve
+```
 
-print("Response:", state.response)
+Hugging Face:
+```bash
+    lemonade -i facebook/opt-125m huggingface-load serve
 ```
 
-# Install Specialized Tools
+Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.
 
-Lemonade supports specialized tools that each require their own setup steps. **Note:** These tools will only appear in `lemonade -h` if you run in an environment that has completed setup.
+# API
 
-## Install OnnxRuntime-GenAI
+Lemonade is also available via API. 
 
-To install support for [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai), use `pip install -e .[llm-oga-igpu]` instead of the default installation command.
+## LEAP APIs
 
-You can then load supported OGA models on to CPU or iGPU with the `oga-load` tool, for example:
+The lemonade enablement platform (LEAP) API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid). This makes it easy to integrate lemonade LLMs into Python applications.
 
-`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"`
+OGA iGPU:
+```python
+from lemonade import leap
 
-You can also launch a server process with:
+model, tokenizer = leap.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu")
 
-The `oga-bench` tool is available to capture tokens/second and time-to-first-token metrics: `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench`. Learn more with `lemonade oga-bench -h`.
+input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
+response = model.generate(input_ids, max_new_tokens=30)
 
-You can also try Phi-3-Mini-128k-Instruct with the following commands:
+print(tokenizer.decode(response[0]))
+```
 
-`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve`
+You can learn more about the LEAP APIs [here](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade).
 
-You can learn more about the CPU and iGPU support in our [OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md).
+## Low-Level API
 
-> Note: early access to AMD's RyzenAI NPU is also available. See the [RyzenAI NPU OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_npu.md) for more information.
+The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.
 
-## Install RyzenAI NPU for PyTorch
+Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:
 
-To run your LLMs on RyzenAI NPU, first install and set up the `ryzenai-transformers` conda environment (see instructions [here](https://github.com/amd/RyzenAI-SW/blob/main/example/transformers/models/llm/docs/README.md)). Then, install `lemonade` into `ryzenai-transformers`. The `ryzenai-npu-load` Tool will become available in that environment.
+```python
+import lemonade.tools.torch_llm as tl
+import lemonade.tools.chat as cl
+from turnkeyml.state import State
 
-You can try it out with: `lemonade -i meta-llama/Llama-2-7b-chat-hf ryzenai-npu-load --device DEVICE llm-prompt -p "Hello, my thoughts are"`
+state = State(cache_dir="cache", build_name="test")
 
-Where `DEVICE` is either "phx" or "stx" if you have a RyzenAI 7xxx/8xxx or 3xx/9xxx processor, respectively.
+state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
+state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)
 
-> Note: only `meta-llama/Llama-2-7b-chat-hf` and `microsoft/Phi-3-mini-4k-instruct` are supported by `lemonade` at this time. Contributions appreciated!
+print("Response:", state.response)
+```
 
 # Contributing
 
-If you decide to contribute, please:
+Contributions are welcome! If you decide to contribute, please:
 
-- do so via a pull request.
-- write your code in keeping with the same style as the rest of this repo's code.
-- add a test under `test/lemonade/llm_api.py` that provides coverage of your new feature.
+- Do so via a pull request.
+- Write your code in keeping with the same style as the rest of this repo's code.
+- Add a test under `test/lemonade` that provides coverage of your new feature.
 
 The best way to contribute is to add new tools to cover more devices and usage scenarios.
 
@@ -150,3 +233,5 @@ To add a new tool:
 1. (Optional) Create a new `.py` file under `src/lemonade/tools` (or use an existing file if your tool fits into a pre-existing family of tools).
 1. Define a new class that inherits the `Tool` class from `TurnkeyML`.
 1. Register the class by adding it to the list of `tools` near the top of `src/lemonade/cli.py`.
+
+You can learn more about contributing on the repository's [contribution guide](https://github.com/onnx/turnkeyml/blob/main/docs/contribute.md).