Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Quark quantization. Update lemonade getting started. #290

Merged
merged 6 commits into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .github/workflows/test_quark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Test Lemonade with Quark Quantization

on:
push:
branches: ["main"]
pull_request:
branches: ["main"]

permissions:
contents: read

jobs:
make-quark-lemonade:
env:
LEMONADE_CI_MODE: "True"
runs-on: windows-latest
steps:
- uses: actions/checkout@v3
- name: Set up Miniconda with 64-bit Python
uses: conda-incubator/setup-miniconda@v2
with:
miniconda-version: "latest"
activate-environment: lemon
python-version: "3.10"
run-post: "false"
- name: Install dependencies
shell: bash -el {0}
run: |
python -m pip install --upgrade pip
conda install pylint
python -m pip check
pip install -e .[llm-oga-cpu]
lemonade-install --quark 0.6.0
- name: Lint with Black
uses: psf/black@stable
with:
options: "--check --verbose"
src: "./src"
- name: Lint with PyLint
shell: bash -el {0}
run: |
pylint src/lemonade/tools/quark --rcfile .pylintrc --disable E0401
- name: Run lemonade tests
shell: bash -el {0}
env:
HF_TOKEN: "${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}" # Required by OGA model_builder in OGA 0.4.0 but not future versions
run: |
python test/lemonade/quark_api.py

219 changes: 152 additions & 67 deletions docs/lemonade/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,125 @@
# Lemonade

Welcome to the project page for `lemonade` the Turnkey LLM Aide!
Contents:

1. [Getting Started](#getting-started)
1. [Install Specialized Tools](#install-specialized-tools)
- [OnnxRuntime GenAI](#install-onnxruntime-genai)
- [RyzenAI NPU for PyTorch](#install-ryzenai-npu-for-pytorch)
1. [Install](#install)
1. [CLI Commands](#cli-commands)
- [Syntax](#syntax)
- [Chatting](#chatting)
- [Accuracy](#accuracy)
- [Benchmarking](#benchmarking)
- [Memory Usage](#memory-usage)
- [Serving](#serving)
1. [API Overview](#api)
1. [Code Organization](#code-organization)
1. [Contributing](#contributing)

# Getting Started

`lemonade` introduces a brand new set of LLM-focused tools.
# Install

## Install
You can quickly get started with `lemonade` by installing the `turnkeyml` [PyPI package](#from-pypi) with the appropriate extras for your backend, or you can [install from source](#from-source-code) by cloning and installing this repository.

## From PyPI

To install `lemonade` from PyPI:

1. Create and activate a [miniconda](https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe) environment.
```bash
conda create -n lemon python=3.10
cond activate lemon
```

3. Install lemonade for you backend of choice:
- [OnnxRuntime GenAI with CPU backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
```bash
pip install -e turnkeyml[llm-oga-cpu]
```
- [OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md):
> Note: Requires Windows and a DirectML-compatible iGPU.
```bash
pip install -e turnkeyml[llm-oga-igpu]
```
- OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
> Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
> - Install the [Ryzen AI driver >= 32.0.203.237](https://ryzenai.docs.amd.com/en/latest/inst.html#install-npu-drivers) (you can check your driver version under Device Manager > Neural Processors).
> - Visit the [AMD Hugging Face page](https://huggingface.co/collections/amd/quark-awq-g128-int4-asym-fp16-onnx-hybrid-13-674b307d2ffa21dd68fa41d5) for supported checkpoints.
```bash
pip install -e turnkeyml[llm-oga-hybrid]
lemonade-install --ryzenai hybrid
```
- Hugging Face (PyTorch) LLMs for CPU backend:
```bash
pip install -e turnkeyml[llm]
```
- llama.cpp: see [instructions](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/llamacpp.md).

4. Use `lemonade -h` to explore the LLM tools, and see the [command](#cli-commands) and [API](#api) examples below.


## From Source Code

To install `lemonade` from source code:

1. Clone: `git clone https://github.com/onnx/turnkeyml.git`
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your TurnkeyML clone)
1. `cd turnkeyml` (where `turnkeyml` is the repo root of your clone)
- Note: be sure to run these installation instructions from the repo root.
1. Create and activate a conda environment:
1. `conda create -n lemon python=3.10`
1. `conda activate lemon`
1. Install lemonade: `pip install -e .[llm]`
- or `pip install -e .[llm-oga-igpu]` if you want to use `onnxruntime-genai` (see [OGA](#install-onnxruntime-genai))
1. `lemonade -h` to explore the LLM tools
1. Follow the same instructions as in the [PyPI installation](#from-pypi), except replace the `turnkeyml` with a `.`.
- For example: `pip install -e .[llm-oga-igpu]`

# CLI Commands

The `lemonade` CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.

Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a `Tool`, and a single call to `lemonade` can invoke any number of `Tools`. Each `Tool` will perform its functionality, then pass its state to the next `Tool` in the command.

You can read each command out loud to understand what it is doing. For example, a command like this:

```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
```

Can be read like this:

## Syntax
> Run `lemonade` on the input `(-i)` checkpoint `microsoft/Phi-3-mini-4k-instruct`. First, load it in the OnnxRuntime GenAI framework (`oga-load`), on to the integrated GPU device (`--device igpu`) in the int4 data type (`--dtype int4`). Then, pass the OGA model to the prompting tool (`llm-prompt`) with the prompt (`-p`) "Hello, my thoughts are" and print the response.

The `lemonade -h` command will show you which options and Tools are available, and `lemonade TOOL -h` will tell you more about that specific Tool.

The `lemonade` CLI uses the same style of syntax as `turnkey`, but with a new set of LLM-specific tools. You can read about that syntax [here](https://github.com/onnx/turnkeyml#how-it-works).

## Chatting

To chat with your LLM try:

`lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"`
OGA iGPU:
```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
```

Hugging Face:
```bash
lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"
```

The LLM will run on CPU with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.
The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the `"Hello, my thoughts are"` with any prompt you like.

You can also replace the `facebook/opt-125m` with any Huggingface checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
You can also replace the `facebook/opt-125m` with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.

You can also set the `--device` argument in `huggingface-load` to load your LLM on a different device.
You can also set the `--device` argument in `oga-load` and `huggingface-load` to load your LLM on a different device.

Run `lemonade huggingface-load -h` and `lemonade llm-prompt -h` to learn more about those tools.

## Accuracy

To measure the accuracy of an LLM using MMLU, try this:

`lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management`
OGA iGPU:
```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management
```

Hugging Face:
```bash
lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
```

That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at `~/.cache/lemonade`.

Expand All @@ -58,18 +129,34 @@ You can run the full suite of MMLU subjects by omitting the `--test` argument. Y

To measure the time-to-first-token and tokens/second of an LLM, try this:

`lemonade -i facebook/opt-125m huggingface-load huggingface-bench`
OGA iGPU:
```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
```

Hugging Face:
```bash
lemonade -i facebook/opt-125m huggingface-load huggingface-bench
```

That command will run a few warmup iterations, then a few generation iterations where performance data is collected.

The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade huggingface-bench -h`.
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running `lemonade oga-bench -h` or `lemonade huggingface-bench -h`.

## Memory Usage

The peak memory used by the lemonade build is captured in the build output. To capture more granular
The peak memory used by the `lemonade` build is captured in the build output. To capture more granular
memory usage information, use the `--memory` flag. For example:

`lemonade -i facebook/opt-125m --memory huggingface-load huggingface-bench`
OGA iGPU:
```bash
lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
```

Hugging Face:
```bash
lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench
```

In this case a `memory_usage.png` file will be generated and stored in the build folder. This file
contains a figure plotting the memory usage over the build time. Learn more by running `lemonade -h`.
Expand All @@ -78,70 +165,66 @@ contains a figure plotting the memory usage over the build time. Learn more by

You can launch a WebSocket server for your LLM with:

`lemonade -i facebook/opt-125m huggingface-load serve`

Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.

Note that the `llm-prompt`, `accuracy-mmlu`, and `serve` tools can all be used with other model-loading tools, for example `onnxruntime-genai` or `ryzenai-transformers`. See [Install Specialized Tools](#install-specialized-tools) for details.

## API

Lemonade is also available via API. Here's a quick example of how to benchmark an LLM:

```python
import lemonade.tools.torch_llm as tl
import lemonade.tools.chat as cl
from turnkeyml.state import State

state = State(cache_dir="cache", build_name="test")

state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)
OGA iGPU:
```bash
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve
```

print("Response:", state.response)
Hugging Face:
```bash
lemonade -i facebook/opt-125m huggingface-load serve
```

# Install Specialized Tools
Once the server has launched, you can connect to it from your own application, or interact directly by following the on-screen instructions to open a basic web app.

Lemonade supports specialized tools that each require their own setup steps. **Note:** These tools will only appear in `lemonade -h` if you run in an environment that has completed setup.
# API

## Install OnnxRuntime-GenAI
Lemonade is also available via API.

To install support for [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai), use `pip install -e .[llm-oga-igpu]` instead of the default installation command.
## LEAP APIs

You can then load supported OGA models on to CPU or iGPU with the `oga-load` tool, for example:
The lemonade enablement platform (LEAP) API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid). This makes it easy to integrate lemonade LLMs into Python applications.

`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"`
OGA iGPU:
```python
from lemonade import leap

You can also launch a server process with:
model, tokenizer = leap.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu")

The `oga-bench` tool is available to capture tokens/second and time-to-first-token metrics: `lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench`. Learn more with `lemonade oga-bench -h`.
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)

You can also try Phi-3-Mini-128k-Instruct with the following commands:
print(tokenizer.decode(response[0]))
```

`lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 serve`
You can learn more about the LEAP APIs [here](https://github.com/onnx/turnkeyml/tree/main/examples/lemonade).

You can learn more about the CPU and iGPU support in our [OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_igpu.md).
## Low-Level API

> Note: early access to AMD's RyzenAI NPU is also available. See the [RyzenAI NPU OGA documentation](https://github.com/onnx/turnkeyml/blob/main/docs/lemonade/ort_genai_npu.md) for more information.
The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.

## Install RyzenAI NPU for PyTorch
Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:

To run your LLMs on RyzenAI NPU, first install and set up the `ryzenai-transformers` conda environment (see instructions [here](https://github.com/amd/RyzenAI-SW/blob/main/example/transformers/models/llm/docs/README.md)). Then, install `lemonade` into `ryzenai-transformers`. The `ryzenai-npu-load` Tool will become available in that environment.
```python
import lemonade.tools.torch_llm as tl
import lemonade.tools.chat as cl
from turnkeyml.state import State

You can try it out with: `lemonade -i meta-llama/Llama-2-7b-chat-hf ryzenai-npu-load --device DEVICE llm-prompt -p "Hello, my thoughts are"`
state = State(cache_dir="cache", build_name="test")

Where `DEVICE` is either "phx" or "stx" if you have a RyzenAI 7xxx/8xxx or 3xx/9xxx processor, respectively.
state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
state = cl.Prompt().run(state, prompt="hi", max_new_tokens=15)

> Note: only `meta-llama/Llama-2-7b-chat-hf` and `microsoft/Phi-3-mini-4k-instruct` are supported by `lemonade` at this time. Contributions appreciated!
print("Response:", state.response)
```

# Contributing

If you decide to contribute, please:
Contributions are welcome! If you decide to contribute, please:

- do so via a pull request.
- write your code in keeping with the same style as the rest of this repo's code.
- add a test under `test/lemonade/llm_api.py` that provides coverage of your new feature.
- Do so via a pull request.
- Write your code in keeping with the same style as the rest of this repo's code.
- Add a test under `test/lemonade` that provides coverage of your new feature.

The best way to contribute is to add new tools to cover more devices and usage scenarios.

Expand All @@ -150,3 +233,5 @@ To add a new tool:
1. (Optional) Create a new `.py` file under `src/lemonade/tools` (or use an existing file if your tool fits into a pre-existing family of tools).
1. Define a new class that inherits the `Tool` class from `TurnkeyML`.
1. Register the class by adding it to the list of `tools` near the top of `src/lemonade/cli.py`.

You can learn more about contributing on the repository's [contribution guide](https://github.com/onnx/turnkeyml/blob/main/docs/contribute.md).
Loading
Loading