Skip to content

Commit

Permalink
Merge pull request #33 from rjojjr/STAGING
Browse files Browse the repository at this point in the history
Release v1.5.0
  • Loading branch information
rjojjr authored Aug 12, 2024
2 parents 7a96a7d + 1b3637c commit 1f50b23
Show file tree
Hide file tree
Showing 7 changed files with 100 additions and 83 deletions.
106 changes: 56 additions & 50 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,25 @@
# Torch Tuner CLI README

The torch-tuner project currently serves as a simple convenient CLI wrapper for fine-tuning(and serving)
Llama based LLM models(and others in the near future) on Nvidia CUDA enabled GPUs(CPU support coming soon) with simple text samples(or JSON Lines files) using [LoRA](https://github.com/microsoft/LoRA), [Transformers](https://huggingface.co/docs/transformers/en/index) and [Torch](https://en.wikipedia.org/wiki/Torch_(machine_learning)).
The torch-tuner project currently serves as a simple convenient CLI wrapper for supervised fine-tuning(and serving)
Llama based LLM models(and others in the near future) on Nvidia CUDA enabled GPUs(CPU support coming soon)
with simple text samples(or JSON Lines files) using [LoRA](https://github.com/microsoft/LoRA), [Transformers](https://huggingface.co/docs/transformers/en/index) and [Torch](https://en.wikipedia.org/wiki/Torch_(machine_learning)).

Use torch-tuner's CLI to perform Supervised Fine-Tuning(SFT)(with LoRA) of
a suitable(Llama only ATM) base model that exists locally or on [Huggingface](https://huggingface.co) with simple text/JSONL and CUDA.
You can also use this CLI to deploy your model(or any model)
as an REST API that mimics commonly used Open AI endpoints.

Ideally, in the future, the torch-tuner project will support more complex training data structures,
non-llama LLM types, CPU based tuning and fine-tuning vision and speech models.
non-llama LLM types, CPU based tuning and fine-tuning vision and speech models.

## Serve Mode(EXPERIMENTAL)
## Running the Torch Tuner CLI

You can run the torch-tuner CLI in the new experimental serve mode to serve your model as a REST API that mimics the [Open AI](https://openai.com/)
completions(`/v1/completions` & `/v1/chat/completions`) endpoints.
The torch-tuner CLI will fine-tune, merge, push(to Huggingface) and/or serve your new fine-tuned model depending
on the arguments you run it with.

```shell
python src/main/main.py \
--serve true \
--serve-model llama-tuned \
--serve-port 8080

# When the Torch Tuner CLI is installed
torch-tuner \
--serve true \
--serve-model llama-tuned \
--serve-port 8080

# Use dynamic quantization
python src/main/main.py \
--serve true \
--serve-model llama-tuned \
--serve-port 8080 \
--use-4bit true
```

The Open AI like REST endpoints will ignore the model provided in the request body, and will
always evaluate all requests against the model that is provided by the `--serve-model` argument.

**WARNING** - Serve mode is currently in an experimental state and should NEVER be used in a production environment.

## Running Torch Tuner

The tuner CLI will fine-tune, merge, push(to Huggingface) and/or serve your new model depending on the arguments
you run it with.

### Using Torch Tuner
### Using the Torch Tuner CLI

The torch-tuner CLI can be installed as a system-wide application, or run from source with python.
I typically wrap/configure my tuner CLI commands with bash scripts for convenience. You could also
use aliases to help keep your most commonly used CLI commands handy and easily accessible.
You might want to install the tuner CLI(using the instructions from the "Install Torch-Tuner CLI" section below) for
Expand All @@ -57,7 +29,7 @@ I currently use this CLI across several different debian based OSes(across multi
work on any OS. The torch-tuner CLI requires that you have the proper CUDA software/drivers(as well as python 3)
installed on the host. I would like to add CPU based tuning in the near future.

#### Install Torch-Tuner CLI
#### Install the Torch Tuner CLI

You can install the torch tuner CLI as a system-wide application on any Linux OS(and Mac OS)(Windows support coming soon[although this will probably work on WSL(Windows Subsystem for Linux), which you should probably be using anyway])
with [this script](scripts/install-torch-tuner.sh) if you don't want to have to mess with python or the repository in general. After installation,
Expand Down Expand Up @@ -103,27 +75,26 @@ to your command.
You can use a dataset from Huggingface by setting the `--training-data-dir` argument
to the desired HF repo name and excluding the `--training-data-file` argument.

#### JSONL
#### Simple Text

The tuner will load plain-text data from a text-based file. It expects each training sample
to consume exactly one line. This might be useful for older models as well tuning a model with
large amounts of plain/unstructured text.

#### JSON Lines(JSONL)

Torch Tuner accepts JSONL training data in addition to plain text.
Torch Tuner accepts [JSONL](https://jsonlines.org/) training data in addition to plain text.

Accepted JSONL Formats:

```json lines
{"messages": [{"role": "system", "content": "You are helpful"}]}
{"messages": [{"role": "system", "content": "You are helpful"}, {"role": "user", "content": "Hi!"}]}

OR

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<context & prompt>", "completion": "<ideal AI response>"}
```

#### Simple Text

The tuner will load plain-text data from a text-based file. It expects each training sample
to consume exactly one line. Currently, this app requires both the path to the directory
where the training data is stored and the training-data file name supplied as separate
arguments.

### Install Dependencies

If you choose to not install the torch-tuner CLI, and run it from
Expand Down Expand Up @@ -178,9 +149,39 @@ To List Available Torch Tuner CLI Arguments:
python src/main/main.py --help
```

### Serve Mode(EXPERIMENTAL)

You can run the torch-tuner CLI in the new experimental serve mode to serve your model as a REST API that mimics the [Open AI](https://openai.com/)
completions(`/v1/completions` & `/v1/chat/completions`) endpoints.

```shell
python src/main/main.py \
--serve true \
--serve-model llama-tuned \
--serve-port 8080

# When the Torch Tuner CLI is installed
torch-tuner \
--serve true \
--serve-model llama-tuned \
--serve-port 8080

# Use dynamic quantization
python src/main/main.py \
--serve true \
--serve-model llama-tuned \
--serve-port 8080 \
--use-4bit true
```

The Open AI like REST endpoints will ignore the model provided in the request body, and will
always evaluate all requests against the model that is provided by the `--serve-model` argument.

**WARNING** - Serve mode is currently in an experimental state and should NEVER be used in a production environment.

### Useful Notes

Most of the default CLI arguments are configured to use the least amount of VRAM possible.
Most of the default CLI arguments are configured to consume the least amount of VRAM possible.

In theory, the base-model(`--base-model`) torch-tuner CLI argument will
accept a path to a locally saved model instead of a Huggingface repository
Expand All @@ -195,6 +196,11 @@ if you don't want to run the CLI(`torch-tuner --help`) to find them.
To request a feature or modification(or report a bug), please
submit a Github Issue. I gladly welcome and encourage any and all feedback.

### Roadmap

To view current plans for future work and futures, please take a look at
the [roadmap](ROADMAP.md)

## LICENSE

This project is [licensed](LICENSE.txt) under MIT.
11 changes: 8 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ I plan to add a public [Trello](https://trello.com/) board for this project at s
but in the meantime I will track work/needs/bugs/requests here.

- Add production wrapper to LLM REST server
- Add ability to provide special tokens
- Add ability to provide special/regular tokens to model vocabulary
- Add Windows OS support
- Add support for non-llama models
- Mistral
Expand All @@ -35,5 +35,10 @@ but in the meantime I will track work/needs/bugs/requests here.
- Add ability to request specific adapters from completions endpoints
- Probably leveraging the model argument that is currently ignored
- Add CPU based SFT
- Add ability to configure training evaluations
- Add ability to prepare/configure/load more advanced datasets
- Add ability to configure/add advanced tuning evaluations
- Add ability to prepare/configure/load more advanced datasets
- Add ability to set max concurrent request for LLM serve mode
- Add queue for waiting requests
- Add multi-gpu support
- Add support for ignored OpenAI request properties
- Add embeddings endpoint to serve mode
4 changes: 3 additions & 1 deletion src/main/arguments/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ def __init__(self,
use_agent_tokens: bool = False,
lr_scheduler_type: str = 'linear',
target_modules: list | None = None,
torch_empty_cache_steps: int | None = 1):
torch_empty_cache_steps: int | None = 1,
warmup_ratio: float = 0.03):
super(TuneArguments, self).__init__(new_model, is_fp16, is_bf16, use_4bit, use_8bit, fp32_cpu_offload, is_chat_model, padding_side, use_agent_tokens)
self.r = r
self.alpha = alpha
Expand Down Expand Up @@ -126,6 +127,7 @@ def __init__(self,
self.lr_scheduler_type = lr_scheduler_type
self.target_modules = target_modules
self.torch_empty_cache_steps = torch_empty_cache_steps
self.warmup_ratio = warmup_ratio


def validate(self) -> None:
Expand Down
19 changes: 12 additions & 7 deletions src/main/base/llm_base_module.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@


def _add_agent_tokens(tokenizer, model):
agent_tokens = ["\nThought:", "\nAction:", "\nAction Input:", "\nObservation:"]
agent_tokens = ["\nThought:", "\nAction:", "\nAction Input:", "\nObservation:", "\nFinal Answer:"]
agent_tokens = set(agent_tokens) - set(tokenizer.vocab.keys())
tokenizer.add_tokens(list(agent_tokens))
if model is not None:
Expand All @@ -28,7 +28,7 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
print(f"Starting fine-tuning of base model {arguments.base_model} for {arguments.new_model}")
print('')
output_dir = f"{arguments.output_directory}/checkpoints/{arguments.new_model}"
lora_dir = f"{arguments.output_directory}/checkpoints/{arguments.new_model}/adapter"
lora_dir = f"{arguments.output_directory}/adapters/{arguments.new_model}"
if not arguments.no_checkpoint:
print(f'Checkpointing to {output_dir}')
print('')
Expand All @@ -43,6 +43,11 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
else:
target_modules = arguments.target_modules

if arguments.use_agent_tokens or arguments.is_chat_model:
target_modules.append("embed_tokens")
target_modules.append("lm_head")
target_modules = list(set(target_modules))

modules_to_save=["embed_tokens"] if arguments.save_embeddings else []


Expand Down Expand Up @@ -85,12 +90,12 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
bf16=arguments.is_bf16,
max_grad_norm=arguments.max_gradient_norm,
max_steps=-1,
warmup_ratio=0.03,
warmup_ratio=arguments.warmup_ratio,
group_by_length=True,
lr_scheduler_type=arguments.lr_scheduler_type,
report_to="tensorboard",
do_eval=arguments.do_eval,
# TODO - add this as tuning arg
# TODO - is this ignored bt SFTTrainer?
max_seq_length=4096,
dataset_text_field="text"
# TODO - investigate for instruction training
Expand All @@ -99,7 +104,7 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:

train = SFTTrainer(
model=model,
train_dataset=ds['train'] if arguments.train_file is not None else ds,
train_dataset=ds['train'],
tokenizer=tokenizer,
args=train_params
)
Expand Down Expand Up @@ -132,8 +137,8 @@ def merge_base(arguments: MergeArguments, tokenizer, base_model, bnb_config) ->
base_model, tokenizer = setup_chat_format(base_model, tokenizer)
if arguments.use_agent_tokens:
_add_agent_tokens(tokenizer, base_model)
lora_dir = f"{arguments.output_dir}/checkpoints/{arguments.new_model}/adapter"
model_dir = f'{arguments.output_dir}/{arguments.new_model}'
lora_dir = f"{arguments.output_dir}/adapters/{arguments.new_model}"
model_dir = f'{arguments.output_dir}/merged-models/{arguments.new_model}'
print(f"merging {arguments.base_model} with LoRA into {arguments.new_model}")

if arguments.use_agent_tokens:
Expand Down
3 changes: 1 addition & 2 deletions src/main/llama/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@


def merge(arguments: MergeArguments) -> None:
lora_dir = f"{arguments.output_dir}/adapters/{arguments.new_model}"
bnb_config, dtype = get_bnb_config_and_dtype(arguments)

base_model = LlamaForCausalLM.from_pretrained(
Expand All @@ -16,8 +17,6 @@ def merge(arguments: MergeArguments) -> None:
torch_dtype=dtype
)

lora_dir = f"{arguments.output_dir}/checkpoints/{arguments.new_model}/adapter"

tokenizer = AutoTokenizer.from_pretrained(lora_dir)
if arguments.padding_side is not None:
tokenizer.pad_token = tokenizer.eos_token
Expand Down
12 changes: 5 additions & 7 deletions src/main/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@
import os

# TODO - Automate this
version = '1.4.5'
version = '1.5.0'

# TODO - Change this once support for more LLMs is added
title = f'Llama AI LLM LoRA Torch Text Fine-Tuner v{version}'
description = 'Fine-Tune Llama LLM models with simple text on Nvidia GPUs using Torch and LoRA.'
title = f'Llama AI LLM LoRA Torch Fine-Tuner v{version}'
description = 'CLI to Fine-Tune Llama AI LLMs with simple text and jsonl on Nvidia GPUs using Torch, Transformers and LoRA.'

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.8,expandable_segments:True"

Expand Down Expand Up @@ -41,16 +41,15 @@ def main() -> None:
factory = llm_executor_factory(LlmExecutorFactoryArguments(model=args.serve_model, use_4bit=args.use_4bit, use_8bit=args.use_8bit, is_fp16=args.use_fp_16, is_bf16=args.use_bf_16, padding_side=args.padding_side))
server = OpenAiLlmServer(factory())
server.start_server(ServerArguments(port=args.serve_port, debug=args.debug))
# TODO - cleaner exit
exit(0)
return

# Do all validations before printing configuration values
do_initial_arg_validation(args)

tuner = tuner_factory()

lora_scale = round(args.lora_alpha / args.lora_r, 1)
model_dir = f'{args.output_directory}/{args.new_model}'
model_dir = f'{args.output_directory}/merged-models/{args.new_model}'

tune_arguments = build_and_validate_tune_args(args)
merge_arguments = build_and_validate_merge_args(args)
Expand Down Expand Up @@ -131,7 +130,6 @@ def main() -> None:
print('')
print('---------------------------------------------')
print(f'{title} COMPLETED')
exit(0)


main_exception_handler(main, title, False)
Loading

0 comments on commit 1f50b23

Please sign in to comment.