Skip to content

Commit

Permalink
Docs: Update training README
Browse files Browse the repository at this point in the history
Signed-off-by: Kelly Brown <[email protected]>
  • Loading branch information
kelbrown20 committed Sep 26, 2024
1 parent 99d4468 commit 1e5a58f
Showing 1 changed file with 134 additions and 92 deletions.
226 changes: 134 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,55 @@
![Release](https://img.shields.io/github/v/release/instructlab/training)
![License](https://img.shields.io/github/license/instructlab/training)

In order to simplify the process of fine-tuning models through the LAB
method, this library provides a simple training interface.
- [Installing](#installing-the-library)
- [Additional Nvidia packages](#additional-nvidia-packages)
- [Using the library](#using-the-library)
- [Learning about the training arguments](#learning-about-training-arguments)
- [`TrainingArgs`](#trainingargs)
- [`DeepSpeedOptions`](#deepspeedoptions)
- [`loraOptions`](#loraoptions)
- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

## Installation
To simplify the process of fine-tuning models with the [LAB
method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.

To get started with the library, you must clone this repo and install it from source via `pip`:
## Installing the library

```bash
# clone the repo and switch to the directory
git clone https://github.com/instructlab/training
cd training
To get started with the library, you must clone this repository and install it via `pip`.

Install the library:

# install the library
pip install .
```bash
pip install instructlab-training
```

For development, install it instead with `pip install -e .` instead
to make local changes while using this library elsewhere.
You can then install the library for development:

### Installing Additional NVIDIA packages
```bash
pip install -e ./training
```

We make use of `flash-attn` and other packages which rely on NVIDIA-specific
CUDA tooling to be installed.
### Additional NVIDIA packages

If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:
This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.

Basic install

```bash
# for a regular install
pip install .[cuda]
```

# or, for an editable install (development)
Editable install (development)

```bash
pip install -e .[cuda]
```

## Usage
## Using the library

Using the library is fairly straightforward, import the necessary items,
You can utilize this training library by importing the necessary items.

```py
from instructlab.training import (
Expand All @@ -52,65 +64,18 @@ from instructlab.training import (
)
```

Then, define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",
You can then define various training arguments. They will serve as the parameters for your training runs. See:

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```
- [Learning about the training argument](#learning-about-training-arguments)
- [Example training run with arguments](#example-training-run-with-arguments)

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)

```

### Customizing `TrainingArgs`
## Learning about training arguments

The `TrainingArgs` class provides most of the customization options
for the training job itself. There are a number of options you can specify, such as setting
DeepSpeed config values or running a LoRA training job instead of a full fine-tune.
for training jobs. There are a number of options you can specify, such as setting
`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.

Here is a breakdown of the general options:
### `TrainingArgs`

| Field | Description |
| --- | --- |
Expand All @@ -131,18 +96,52 @@ Here is a breakdown of the general options:
| mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |
| deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
| lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
| chat_tmpl_path | Specifies the chat template / special tokens for training. |
| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |

#### `DeepSpeedOptions`
### `DeepSpeedOptions`

We only currently support a few options in `DeepSpeedOptions`:
This library only currently support a few options in `DeepSpeedOptions`:
The default is to run with DeepSpeed, so these options only currently
allow you to customize aspects of the ZeRO stage 2 optimizer.

| Field | Description |
| --- | --- |
| cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |

For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)

#### `loraOptions`
#### `FSDPOptions`

Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
They are listed below:

| Field | Description |
| --- | --- |
| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are: `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |

> [!NOTE]
> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.
### `loraOptions`

LoRA options currently supported:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

#### Example run with LoRa options

If you'd like to do a LoRA train, you can specify a LoRA
option to `TrainingArgs` via the `LoraOptions` object.
Expand All @@ -160,23 +159,12 @@ training_args = TrainingArgs(
)
```

Here is the definition for what we currently support today:

| Field | Description |
| --- | --- |
| rank | The rank parameter for LoRA training. |
| alpha | The alpha parameter for LoRA training. |
| dropout | The dropout rate for LoRA training. |
| target_modules | The list of target modules for LoRA training. |
| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |

### Customizing `TorchrunArgs`
### Learning about `TorchrunArgs` arguments

When running the training script, we always invoke `torchrun`.

If you are running a single-GPU system or something that doesn't
otherwise require distributed training configuration, you can
just create a default object:
otherwise require distributed training configuration, you can create a default object:

```python
run_training(
Expand All @@ -188,12 +176,14 @@ run_training(
```

However, if you want to specify a more complex configuration,
we currently expose all of the options that [torchrun accepts
the library currently supports all the options that [torchrun accepts
today](https://pytorch.org/docs/stable/elastic/run.html#definitions).

> ![NOTE]
> [!NOTE]
> For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).
#### Example training run with `TorchrunArgs` arguments

For example, in a 8-GPU, 2-machine system, we would
specify the following torchrun config:

Expand Down Expand Up @@ -236,3 +226,55 @@ run_training(
train_args=training_args
)
```

## Example training run with arguments

Define the training arguments which will serve as the
parameters for our training run:

```py
# define training-specific arguments
training_args = TrainingArgs(
# define data-specific arguments
model_path = "ibm-granite/granite-7b-base",
data_path = "path/to/dataset.jsonl",
ckpt_output_dir = "data/saved_checkpoints",
data_output_dir = "data/outputs",

# define model-trianing parameters
max_seq_len = 4096,
max_batch_len = 60000,
num_epochs = 10,
effective_batch_size = 3840,
save_samples = 250000,
learning_rate = 2e-6,
warmup_steps = 800,
is_padding_free = True, # set this to true when using Granite-based models
random_seed = 42,
)
```

We'll also need to define the settings for running a multi-process job
via `torchrun`. To do this, create a `TorchrunArgs` object.

> [!TIP]
> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
```py
torchrun_args = TorchrunArgs(
nnodes = 1, # number of machines
nproc_per_node = 8, # num GPUs per machine
node_rank = 0, # node rank for this machine
rdzv_id = 123,
rdzv_endpoint = '127.0.0.1:12345'
)
```

Finally, you can just call `run_training` and this library will handle
the rest 🙂.

```py
run_training(
torchrun_args=torchrun_args,
training_args=training_args,
)

0 comments on commit 1e5a58f

Please sign in to comment.