Docs: Update training README

Signed-off-by: Kelly Brown <[email protected]>
instructlab · Sep 26, 2024 · 1e5a58f · 1e5a58f
1 parent 99d4468
commit 1e5a58f
Showing 1 changed file with 134 additions and 92 deletions.
diff --git a/README.md b/README.md
@@ -5,43 +5,55 @@
 ![Release](https://img.shields.io/github/v/release/instructlab/training)
 ![License](https://img.shields.io/github/license/instructlab/training)
 
-In order to simplify the process of fine-tuning models through the LAB
-method, this library provides a simple training interface.
+- [Installing](#installing-the-library)
+  - [Additional Nvidia packages](#additional-nvidia-packages)
+- [Using the library](#using-the-library)
+- [Learning about the training arguments](#learning-about-training-arguments)
+  - [`TrainingArgs`](#trainingargs)
+  - [`DeepSpeedOptions`](#deepspeedoptions)
+  - [`loraOptions`](#loraoptions)
+- [Learning about `TorchrunArgs` arguments](#learning-about-torchrunargs-arguments)
+- [Example training run with arguments](#example-training-run-with-arguments)
 
-## Installation
+To simplify the process of fine-tuning models with the [LAB
+method](https://arxiv.org/abs/2403.01081), this library provides a simple training interface.
 
-To get started with the library, you must clone this repo and install it from source via `pip`:
+## Installing the library
 
-```bash
-# clone the repo and switch to the directory
-git clone https://github.com/instructlab/training
-cd training
+To get started with the library, you must clone this repository and install it via `pip`.
+
+Install the library:
 
-# install the library
-pip install .
+```bash
+pip install instructlab-training 
 ```
 
-For development, install it instead with `pip install -e .` instead
-to make local changes while using this library elsewhere.
+You can then install the library for development:
 
-### Installing Additional NVIDIA packages
+```bash
+pip install -e ./training
+```
 
-We make use of `flash-attn` and other packages which rely on NVIDIA-specific
-CUDA tooling to be installed.
+### Additional NVIDIA packages
 
-If you are using NVIDIA hardware with CUDA, please install the additional dependencies via:
+This library uses the `flash-attn` package as well as other packages, which rely on NVIDIA-specific CUDA tooling to be installed.
+If you are using NVIDIA hardware with CUDA, you need to install the following additional dependencies.
+
+Basic install
 
 ```bash
-# for a regular install
 pip install .[cuda]
+```
 
-# or, for an editable install (development)
+Editable install (development)
+
+```bash
 pip install -e .[cuda]
 ```
 
-## Usage
+## Using the library
 
-Using the library is fairly straightforward, import the necessary items,
+You can utilize this training library by importing the necessary items.
 
 ```py
 from instructlab.training import (
@@ -52,65 +64,18 @@ from instructlab.training import (
 )
 ```
 
-Then, define the training arguments which will serve as the
-parameters for our training run:
-
-```py
-# define training-specific arguments
-training_args = TrainingArgs(
-    # define data-specific arguments
-    model_path = "ibm-granite/granite-7b-base",
-    data_path = "path/to/dataset.jsonl",
-    ckpt_output_dir = "data/saved_checkpoints",
-    data_output_dir = "data/outputs",
+You can then define various training arguments. They will serve as the parameters for your training runs. See:
 
-    # define model-trianing parameters
-    max_seq_len = 4096,
-    max_batch_len = 60000,
-    num_epochs = 10,
-    effective_batch_size = 3840,
-    save_samples = 250000,
-    learning_rate = 2e-6,
-    warmup_steps = 800,
-    is_padding_free = True, # set this to true when using Granite-based models
-    random_seed = 42,
-)
-```
+- [Learning about the training argument](#learning-about-training-arguments)
+- [Example training run with arguments](#example-training-run-with-arguments)
 
-We'll also need to define the settings for running a multi-process job
-via `torchrun`. To do this, create a `TorchrunArgs` object.
-
-> [!TIP]
-> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
-
-```py
-torchrun_args = TorchrunArgs(
-    nnodes = 1, # number of machines 
-    nproc_per_node = 8, # num GPUs per machine
-    node_rank = 0, # node rank for this machine
-    rdzv_id = 123,
-    rdzv_endpoint = '127.0.0.1:12345'
-)
-```
-
-Finally, you can just call `run_training` and this library will handle
-the rest 🙂.
-
-```py
-run_training(
-    torchrun_args=torchrun_args,
-    training_args=training_args,
-)
-
-```
-
-### Customizing `TrainingArgs`
+## Learning about training arguments
 
 The `TrainingArgs` class provides most of the customization options
-for the training job itself. There are a number of options you can specify, such as setting
-DeepSpeed config values or running a LoRA training job instead of a full fine-tune.
+for training jobs. There are a number of options you can specify, such as setting
+`DeepSpeed` config values or running a `LoRA` training job instead of a full fine-tune.
 
-Here is a breakdown of the general options:
+### `TrainingArgs`
 
 | Field | Description |
 | --- | --- |
@@ -131,18 +96,52 @@ Here is a breakdown of the general options:
 | mock_data_len | Max length of a single mock data sample. Equivalent to `max_seq_len` but for mock data. |
 | deepspeed_options | Config options to specify for the DeepSpeed optimizer. |
 | lora | Options to specify if you intend to perform a LoRA train instead of a full fine-tune. |
+| chat_tmpl_path | Specifies the chat template / special tokens for training. |
+| checkpoint_at_epoch | Whether or not we should save a checkpoint at the end of each epoch. |
+| fsdp_options | The settings for controlling FSDP when it's selected as the distributed backend. |
+| distributed_backend | Specifies which distributed training backend to use. Supported options are "fsdp" and "deepspeed". |
+| disable_flash_attn | Disables flash attention when set to true. This allows for training on older devices. |
 
-#### `DeepSpeedOptions`
+### `DeepSpeedOptions`
 
-We only currently support a few options in `DeepSpeedOptions`:
+This library only currently support a few options in `DeepSpeedOptions`:
 The default is to run with DeepSpeed, so these options only currently
 allow you to customize aspects of the ZeRO stage 2 optimizer.
 
 | Field | Description |
 | --- | --- |
 | cpu_offload_optimizer | Whether or not to do CPU offloading in DeepSpeed stage 2. |
+| cpu_offload_optimizer_ratio | Floating point between 0 & 1. Specifies the ratio of parameters updating (i.e. optimizer step) on CPU side. |
+| cpu_offload_optimizer_pin_memory | If true, offload to page-locked CPU memory. This could boost throughput at the cost of extra memory overhead. |
+| save_samples | The number of samples to see before saving a DeepSpeed checkpoint. |
+
+For more information about DeepSpeed, see [deepspeed.ai](https://www.deepspeed.ai/)
 
-#### `loraOptions`
+#### `FSDPOptions`
+
+Like DeepSpeed, we only expose a number of parameters for you to modify with FSDP.
+They are listed below:
+
+| Field | Description |
+| --- | --- |
+| cpu_offload_params | When set to true, offload parameters from the accelerator onto the CPU. This is an all-or-nothing option. |
+| sharding_strategy | Specifies the model sharding strategy that FSDP should use. Valid options are:  `FULL_SHARD` (ZeRO-3), `HYBRID_SHARD` (ZeRO-3*), `SHARD_GRAD_OP` (ZeRO-2), and `NO_SHARD`. |
+
+> [!NOTE]
+> For `sharding_strategy` - Only `SHARD_GRAD_OP` has been extensively tested and is actively supported by this library.
+### `loraOptions`
+
+LoRA options currently supported:
+
+| Field | Description |
+| --- | --- |
+| rank | The rank parameter for LoRA training. |
+| alpha | The alpha parameter for LoRA training. |
+| dropout | The dropout rate for LoRA training. |
+| target_modules | The list of target modules for LoRA training. |
+| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |
+
+#### Example run with LoRa options
 
 If you'd like to do a LoRA train, you can specify a LoRA
 option to `TrainingArgs` via the `LoraOptions` object.
@@ -160,23 +159,12 @@ training_args = TrainingArgs(
 )
 ```
 
-Here is the definition for what we currently support today:
-
-| Field | Description |
-| --- | --- |
-| rank | The rank parameter for LoRA training. |
-| alpha | The alpha parameter for LoRA training. |
-| dropout | The dropout rate for LoRA training. |
-| target_modules | The list of target modules for LoRA training. |
-| quantize_data_type | The data type for quantization in LoRA training. Valid options are `None` and `"nf4"` |
-
-### Customizing `TorchrunArgs`
+### Learning about `TorchrunArgs` arguments
 
 When running the training script, we always invoke `torchrun`.
 
 If you are running a single-GPU system or something that doesn't
-otherwise require distributed training configuration, you can
-just create a default object:
+otherwise require distributed training configuration, you can create a default object:
 
 ```python
 run_training(
@@ -188,12 +176,14 @@ run_training(
 ```
 
 However, if you want to specify a more complex configuration,
-we currently expose all of the options that [torchrun accepts
+the library currently supports all the options that [torchrun accepts
 today](https://pytorch.org/docs/stable/elastic/run.html#definitions).
 
-> ![NOTE]
+> [!NOTE]
 > For more information about the `torchrun` arguments, please consult the [torchrun documentation](https://pytorch.org/docs/stable/elastic/run.html#definitions).
 
+#### Example training run with `TorchrunArgs` arguments
+
 For example, in a 8-GPU, 2-machine system, we would
 specify the following torchrun config:
 
@@ -236,3 +226,55 @@ run_training(
     train_args=training_args
 )
 ```
+
+## Example training run with arguments
+
+Define the training arguments which will serve as the
+parameters for our training run:
+
+```py
+# define training-specific arguments
+training_args = TrainingArgs(
+    # define data-specific arguments
+    model_path = "ibm-granite/granite-7b-base",
+    data_path = "path/to/dataset.jsonl",
+    ckpt_output_dir = "data/saved_checkpoints",
+    data_output_dir = "data/outputs",
+
+    # define model-trianing parameters
+    max_seq_len = 4096,
+    max_batch_len = 60000,
+    num_epochs = 10,
+    effective_batch_size = 3840,
+    save_samples = 250000,
+    learning_rate = 2e-6,
+    warmup_steps = 800,
+    is_padding_free = True, # set this to true when using Granite-based models
+    random_seed = 42,
+)
+```
+
+We'll also need to define the settings for running a multi-process job
+via `torchrun`. To do this, create a `TorchrunArgs` object.
+
+> [!TIP]
+> Note, for single-GPU jobs, you can simply set `nnodes = 1` and `nproc_per_node=1`.
+
+```py
+torchrun_args = TorchrunArgs(
+    nnodes = 1, # number of machines 
+    nproc_per_node = 8, # num GPUs per machine
+    node_rank = 0, # node rank for this machine
+    rdzv_id = 123,
+    rdzv_endpoint = '127.0.0.1:12345'
+)
+```
+
+Finally, you can just call `run_training` and this library will handle
+the rest 🙂.
+
+```py
+run_training(
+    torchrun_args=torchrun_args,
+    training_args=training_args,
+)