Merge pull request #33 from rjojjr/STAGING

Release v1.5.0
rjojjr · Aug 12, 2024 · 1f50b23 · 1f50b23
2 parents 7a96a7d + 1b3637c
commit 1f50b23
Show file tree

Hide file tree

Showing 7 changed files with 100 additions and 83 deletions.
diff --git a/README.md b/README.md
@@ -1,53 +1,25 @@
 # Torch Tuner CLI README
 
-The torch-tuner project currently serves as a simple convenient CLI wrapper for fine-tuning(and serving) 
-Llama based LLM models(and others in the near future) on Nvidia CUDA enabled GPUs(CPU support coming soon) with simple text samples(or JSON Lines files) using [LoRA](https://github.com/microsoft/LoRA), [Transformers](https://huggingface.co/docs/transformers/en/index) and [Torch](https://en.wikipedia.org/wiki/Torch_(machine_learning)).
+The torch-tuner project currently serves as a simple convenient CLI wrapper for supervised fine-tuning(and serving) 
+Llama based LLM models(and others in the near future) on Nvidia CUDA enabled GPUs(CPU support coming soon) 
+with simple text samples(or JSON Lines files) using [LoRA](https://github.com/microsoft/LoRA), [Transformers](https://huggingface.co/docs/transformers/en/index) and [Torch](https://en.wikipedia.org/wiki/Torch_(machine_learning)).
 
 Use torch-tuner's CLI to perform Supervised Fine-Tuning(SFT)(with LoRA) of
 a suitable(Llama only ATM) base model that exists locally or on [Huggingface](https://huggingface.co) with simple text/JSONL and CUDA.
 You can also use this CLI to deploy your model(or any model)
 as an REST API that mimics commonly used Open AI endpoints.
 
 Ideally, in the future, the torch-tuner project will support more complex training data structures,
-non-llama LLM types, CPU based tuning and fine-tuning vision and speech models. 
+non-llama LLM types, CPU based tuning and fine-tuning vision and speech models.
 
-## Serve Mode(EXPERIMENTAL)
+## Running the Torch Tuner CLI
 
-You can run the torch-tuner CLI in the new experimental serve mode to serve your model as a REST API that mimics the [Open AI](https://openai.com/) 
-completions(`/v1/completions` & `/v1/chat/completions`) endpoints.
+The torch-tuner CLI will fine-tune, merge, push(to Huggingface) and/or serve your new fine-tuned model depending 
+on the arguments you run it with.
 
-```shell
-python src/main/main.py \
-  --serve true \
-  --serve-model llama-tuned \
-  --serve-port 8080
-
-# When the Torch Tuner CLI is installed
-torch-tuner \
-  --serve true \
-  --serve-model llama-tuned \
-  --serve-port 8080
-
-# Use dynamic quantization
-python src/main/main.py \
-  --serve true \
-  --serve-model llama-tuned \
-  --serve-port 8080 \
-  --use-4bit true
-```
-
-The Open AI like REST endpoints will ignore the model provided in the request body, and will
-always evaluate all requests against the model that is provided by the `--serve-model` argument.
-
-**WARNING** - Serve mode is currently in an experimental state and should NEVER be used in a production environment.
-
-## Running Torch Tuner
-
-The tuner CLI will fine-tune, merge, push(to Huggingface) and/or serve your new model depending on the arguments
-you run it with.
-
-### Using Torch Tuner
+### Using the Torch Tuner CLI
 
+The torch-tuner CLI can be installed as a system-wide application, or run from source with python.
 I typically wrap/configure my tuner CLI commands with bash scripts for convenience. You could also
 use aliases to help keep your most commonly used CLI commands handy and easily accessible.
 You might want to install the tuner CLI(using the instructions from the "Install Torch-Tuner CLI" section below) for 
@@ -57,7 +29,7 @@ I currently use this CLI across several different debian based OSes(across multi
 work on any OS. The torch-tuner CLI requires that you have the proper CUDA software/drivers(as well as python 3)
 installed on the host. I would like to add CPU based tuning in the near future.
 
-#### Install Torch-Tuner CLI
+#### Install the Torch Tuner CLI
 
 You can install the torch tuner CLI as a system-wide application on any Linux OS(and Mac OS)(Windows support coming soon[although this will probably work on WSL(Windows Subsystem for Linux), which you should probably be using anyway]) 
 with [this script](scripts/install-torch-tuner.sh) if you don't want to have to mess with python or the repository in general. After installation,
@@ -103,27 +75,26 @@ to your command.
 You can use a dataset from Huggingface by setting the `--training-data-dir` argument
 to the desired HF repo name and excluding the `--training-data-file` argument.
 
-#### JSONL
+#### Simple Text
+
+The tuner will load plain-text data from a text-based file. It expects each training sample
+to consume exactly one line. This might be useful for older models as well tuning a model with 
+large amounts of plain/unstructured text.
+
+#### JSON Lines(JSONL)
 
-Torch Tuner accepts JSONL training data in addition to plain text.
+Torch Tuner accepts [JSONL](https://jsonlines.org/) training data in addition to plain text.
 
 Accepted JSONL Formats:
 
 ```json lines
-{"messages": [{"role": "system", "content": "You are helpful"}]}
+{"messages": [{"role": "system", "content": "You are helpful"}, {"role":  "user", "content":  "Hi!"}]}
 
 OR
 
-{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
+{"prompt": "<context & prompt>", "completion": "<ideal AI response>"}
 ```
 
-#### Simple Text
-
-The tuner will load plain-text data from a text-based file. It expects each training sample
-to consume exactly one line. Currently, this app requires both the path to the directory
-where the training data is stored and the training-data file name supplied as separate 
-arguments.
-
 ### Install Dependencies
 
 If you choose to not install the torch-tuner CLI, and run it from
@@ -178,9 +149,39 @@ To List Available Torch Tuner CLI Arguments:
 python src/main/main.py --help
 ```
 
+### Serve Mode(EXPERIMENTAL)
+
+You can run the torch-tuner CLI in the new experimental serve mode to serve your model as a REST API that mimics the [Open AI](https://openai.com/)
+completions(`/v1/completions` & `/v1/chat/completions`) endpoints.
+
+```shell
+python src/main/main.py \
+  --serve true \
+  --serve-model llama-tuned \
+  --serve-port 8080
+
+# When the Torch Tuner CLI is installed
+torch-tuner \
+  --serve true \
+  --serve-model llama-tuned \
+  --serve-port 8080
+
+# Use dynamic quantization
+python src/main/main.py \
+  --serve true \
+  --serve-model llama-tuned \
+  --serve-port 8080 \
+  --use-4bit true
+```
+
+The Open AI like REST endpoints will ignore the model provided in the request body, and will
+always evaluate all requests against the model that is provided by the `--serve-model` argument.
+
+**WARNING** - Serve mode is currently in an experimental state and should NEVER be used in a production environment.
+
 ### Useful Notes
 
-Most of the default CLI arguments are configured to use the least amount of VRAM possible.
+Most of the default CLI arguments are configured to consume the least amount of VRAM possible.
 
 In theory, the base-model(`--base-model`) torch-tuner CLI argument will 
 accept a path to a locally saved model instead of a Huggingface repository
@@ -195,6 +196,11 @@ if you don't want to run the CLI(`torch-tuner --help`) to find them.
 To request a feature or modification(or report a bug), please
 submit a Github Issue. I gladly welcome and encourage any and all feedback.
 
+### Roadmap
+
+To view current plans for future work and futures, please take a look at
+the [roadmap](ROADMAP.md)
+
 ## LICENSE
 
 This project is [licensed](LICENSE.txt) under MIT. 
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -20,7 +20,7 @@ I plan to add a public [Trello](https://trello.com/) board for this project at s
 but in the meantime I will track work/needs/bugs/requests here.
 
 - Add production wrapper to LLM REST server
-- Add ability to provide special tokens
+- Add ability to provide special/regular tokens to model vocabulary
 - Add Windows OS support
 - Add support for non-llama models
   - Mistral
@@ -35,5 +35,10 @@ but in the meantime I will track work/needs/bugs/requests here.
 - Add ability to request specific adapters from completions endpoints
   - Probably leveraging the model argument that is currently ignored
 - Add CPU based SFT
-- Add ability to configure training evaluations
-- Add ability to prepare/configure/load more advanced datasets
+- Add ability to configure/add advanced tuning evaluations
+- Add ability to prepare/configure/load more advanced datasets
+- Add ability to set max concurrent request for LLM serve mode
+  - Add queue for waiting requests
+- Add multi-gpu support
+- Add support for ignored OpenAI request properties
+- Add embeddings endpoint to serve mode
diff --git a/src/main/arguments/arguments.py b/src/main/arguments/arguments.py
@@ -98,7 +98,8 @@ def __init__(self,
                  use_agent_tokens: bool = False,
                  lr_scheduler_type: str = 'linear',
                  target_modules: list | None = None,
-                 torch_empty_cache_steps: int | None = 1):
+                 torch_empty_cache_steps: int | None = 1,
+                 warmup_ratio: float = 0.03):
         super(TuneArguments, self).__init__(new_model, is_fp16, is_bf16, use_4bit, use_8bit, fp32_cpu_offload, is_chat_model, padding_side, use_agent_tokens)
         self.r = r
         self.alpha = alpha
@@ -126,6 +127,7 @@ def __init__(self,
         self.lr_scheduler_type = lr_scheduler_type
         self.target_modules = target_modules
         self.torch_empty_cache_steps = torch_empty_cache_steps
+        self.warmup_ratio = warmup_ratio
 
 
     def validate(self) -> None:

diff --git a/src/main/base/llm_base_module.py b/src/main/base/llm_base_module.py
@@ -12,7 +12,7 @@
 
 
 def _add_agent_tokens(tokenizer, model):
-    agent_tokens = ["\nThought:", "\nAction:", "\nAction Input:", "\nObservation:"]
+    agent_tokens = ["\nThought:", "\nAction:", "\nAction Input:", "\nObservation:", "\nFinal Answer:"]
     agent_tokens = set(agent_tokens) - set(tokenizer.vocab.keys())
     tokenizer.add_tokens(list(agent_tokens))
     if model is not None:
@@ -28,7 +28,7 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
     print(f"Starting fine-tuning of base model {arguments.base_model} for {arguments.new_model}")
     print('')
     output_dir = f"{arguments.output_directory}/checkpoints/{arguments.new_model}"
-    lora_dir = f"{arguments.output_directory}/checkpoints/{arguments.new_model}/adapter"
+    lora_dir = f"{arguments.output_directory}/adapters/{arguments.new_model}"
     if not arguments.no_checkpoint:
         print(f'Checkpointing to {output_dir}')
         print('')
@@ -43,6 +43,11 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
     else:
         target_modules = arguments.target_modules
 
+    if arguments.use_agent_tokens or arguments.is_chat_model:
+        target_modules.append("embed_tokens")
+        target_modules.append("lm_head")
+        target_modules = list(set(target_modules))
+
     modules_to_save=["embed_tokens"] if arguments.save_embeddings else []
 
 
@@ -85,12 +90,12 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
         bf16=arguments.is_bf16,
         max_grad_norm=arguments.max_gradient_norm,
         max_steps=-1,
-        warmup_ratio=0.03,
+        warmup_ratio=arguments.warmup_ratio,
         group_by_length=True,
         lr_scheduler_type=arguments.lr_scheduler_type,
         report_to="tensorboard",
         do_eval=arguments.do_eval,
-        # TODO - add this as tuning arg
+        # TODO - is this ignored bt SFTTrainer?
         max_seq_length=4096,
         dataset_text_field="text"
         # TODO - investigate for instruction training
@@ -99,7 +104,7 @@ def fine_tune_base(arguments: TuneArguments, tokenizer, base_model) -> None:
 
     train = SFTTrainer(
         model=model,
-        train_dataset=ds['train'] if arguments.train_file is not None else ds,
+        train_dataset=ds['train'],
         tokenizer=tokenizer,
         args=train_params
     )
@@ -132,8 +137,8 @@ def merge_base(arguments: MergeArguments, tokenizer, base_model, bnb_config) ->
         base_model, tokenizer = setup_chat_format(base_model, tokenizer)
     if arguments.use_agent_tokens:
         _add_agent_tokens(tokenizer, base_model)
-    lora_dir = f"{arguments.output_dir}/checkpoints/{arguments.new_model}/adapter"
-    model_dir = f'{arguments.output_dir}/{arguments.new_model}'
+    lora_dir = f"{arguments.output_dir}/adapters/{arguments.new_model}"
+    model_dir = f'{arguments.output_dir}/merged-models/{arguments.new_model}'
     print(f"merging {arguments.base_model} with LoRA into {arguments.new_model}")
 
     if arguments.use_agent_tokens:

diff --git a/src/main/llama/functions.py b/src/main/llama/functions.py
@@ -7,6 +7,7 @@
 
 
 def merge(arguments: MergeArguments) -> None:
+    lora_dir = f"{arguments.output_dir}/adapters/{arguments.new_model}"
     bnb_config, dtype = get_bnb_config_and_dtype(arguments)
 
     base_model = LlamaForCausalLM.from_pretrained(
@@ -16,8 +17,6 @@ def merge(arguments: MergeArguments) -> None:
         torch_dtype=dtype
     )
 
-    lora_dir = f"{arguments.output_dir}/checkpoints/{arguments.new_model}/adapter"
-
     tokenizer = AutoTokenizer.from_pretrained(lora_dir)
     if arguments.padding_side is not None:
         tokenizer.pad_token = tokenizer.eos_token

diff --git a/src/main/main.py b/src/main/main.py
@@ -9,11 +9,11 @@
 import os
 
 # TODO - Automate this
-version = '1.4.5'
+version = '1.5.0'
 
 # TODO - Change this once support for more LLMs is added
-title = f'Llama AI LLM LoRA Torch Text Fine-Tuner v{version}'
-description = 'Fine-Tune Llama LLM models with simple text on Nvidia GPUs using Torch and LoRA.'
+title = f'Llama AI LLM LoRA Torch Fine-Tuner v{version}'
+description = 'CLI to Fine-Tune Llama AI LLMs with simple text and jsonl on Nvidia GPUs using Torch, Transformers and LoRA.'
 
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "garbage_collection_threshold:0.8,expandable_segments:True"
 
@@ -41,16 +41,15 @@ def main() -> None:
         factory = llm_executor_factory(LlmExecutorFactoryArguments(model=args.serve_model, use_4bit=args.use_4bit, use_8bit=args.use_8bit, is_fp16=args.use_fp_16, is_bf16=args.use_bf_16, padding_side=args.padding_side))
         server = OpenAiLlmServer(factory())
         server.start_server(ServerArguments(port=args.serve_port, debug=args.debug))
-        # TODO - cleaner exit
-        exit(0)
+        return
 
     # Do all validations before printing configuration values
     do_initial_arg_validation(args)
 
     tuner = tuner_factory()
 
     lora_scale = round(args.lora_alpha / args.lora_r, 1)
-    model_dir = f'{args.output_directory}/{args.new_model}'
+    model_dir = f'{args.output_directory}/merged-models/{args.new_model}'
 
     tune_arguments = build_and_validate_tune_args(args)
     merge_arguments = build_and_validate_merge_args(args)
@@ -131,7 +130,6 @@ def main() -> None:
     print('')
     print('---------------------------------------------')
     print(f'{title} COMPLETED')
-    exit(0)
 
 
 main_exception_handler(main, title, False)