adds basic smoketests for main_ds and data_process CLI args #280

JamesKunstle · 2024-10-17T06:06:55Z

During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models.

Also solves an inconvenience that's been annoying- testing for a given platform involves:

grab image with versions of dependencies for cards
run container
download and install this repo
write dumb versions of these scripts to run training and see if it breaks.

RobotSail · 2024-10-17T14:26:49Z

tests/smoketest.sh

+MODEL_NAME="instructlab/granite-7b-lab"
+# gets directory of current file.
+SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+CORRECT_WORKING_DIR="$SCRIPT_DIR/../src/instructlab/training/"


@JamesKunstle Could you please use "${SCRIPT_DIR}" as the format for referencing shell variables? This will be really useful to have as a consistent format and reduces potential bugs introduced through incorrect delimiting.

RobotSail · 2024-10-17T14:27:32Z

tests/smoketest.sh

+    --checkpoint_at_epoch \
+    --accelerate_full_state_at_epoch \
+    --distributed_training_framework="$DISTRIB_FRAMEWORK" \
+    --max_batch_len="$MAX_BATCH_LEN" \


This will cause it to break if we keep the training \

Suggested change

--max_batch_len="$MAX_BATCH_LEN" \

--max_batch_len="$MAX_BATCH_LEN"

RobotSail · 2024-10-17T14:34:44Z

tests/smoketest.sh

+}
+
+function prepare_data () {
+    # preprocesses .jsonl messages data so that it's a valid


For documenting shell functions, the recommendation for documenting functions would be:

################################################################ # Preprocesses .jsonl messages data so that it's a valid # input to the model (inputs tokenized, formatted with mask, # etc.) # then, data is trimmed to a determined length to make training # go faster. # Inputs: # None # Globals: # - SAMPLE_DATA_PATH # - DATA_DIR # - MODEL_NAME # - SAMPLES_TRAINED_ON # - COMPUTED_DATA_PATH # Returns: # None ################################################################

^ We should adopt the above convention so that it's easier for people to drop in and build on top of these tests in the future. Otherwise they will need to spend a lot more effort to understand why all of these things exist and what their purpose is.

Thank you for doing this, it looks great now!

RobotSail · 2024-10-17T14:37:18Z

tests/smoketest.sh

+}
+
+# ############### Setup and tests ############### 
+setup_tmpdir


We should probably move this into a main function:

Suggested change

setup_tmpdir

function main() {

setup_tmpdir

trap "rm -rf $TMP_DIR" EXIT

#NOTE (jkunstle): script is run as though it's

# in the same source dir as main_ds and data_process.

cd "$CORRECT_WORKING_DIR"

echo "CURRENT WORKING DIRECTORY: $(pwd)"

prepare_data

test_standard_loop_noflashattention_nogranite

_cleanup_saved_checkpoints

test_standard_loop_nongranite

_cleanup_saved_checkpoints

test_standard_loop

}

main

RobotSail · 2024-10-17T14:38:55Z

tests/README.md

+
+1. No Flash Attention or Granite
+2. No Granite but Flash Attention enabled
+3. Granite and Flash Attention enabled


We should consider making this a checkmark list so we can check things off over time:

- [ ] No Flash Attention or Granite - [ ] No Granite but Flash Attention enabled - [ ] Granite and Flash Attention enabled

These are done

Even better

RobotSail · 2024-10-17T14:41:08Z

tests/README.md

+
+The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.
+
+> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.


Github actually has a syntax for this built-in:

Suggested change

> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.

> [!NOTE]

> You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.

It renders like this:

Note
You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run pip3 install -e . to install the package in editable mode.

RobotSail · 2024-10-17T14:43:15Z

tests/smoketest.sh

@@ -0,0 +1,145 @@
+#!/usr/bin/env bash
+set -eux


I would also add -o pipefail here as this will cause it to stop if any command within a pipeline fails:

Suggested change

set -eux

set -eux -o pipefail

RobotSail

Looks good, some minor comments. Once they are addressed I will LGTM

RobotSail · 2024-10-17T14:45:28Z

The main thing is that to make it easier for new developers to come on board and build on top of what we, we should standardize on the Google Shell Styleguide. This will make it much easier for newcomers to understand how things work, why they're organized how they are, and make changes with confidence.

danmcp · 2024-10-17T14:55:47Z

tests/smoketest.sh

+
+# ############### User-modifiable parameters ############### 
+# Change these as needed
+NUM_GPUS=8


Should we go ahead and make this a param?

Maxusmusti · 2024-10-22T20:46:48Z

tests/README.md

+
+1. No Flash Attention or Granite
+2. No Granite but Flash Attention enabled
+3. Granite and Flash Attention enabled


Do we want a fourth Dolomite path? I guess this might make more sense after #257

Yeah that'd be smart, we can add that in another iteration

created an issue for this

Maxusmusti · 2024-10-22T20:49:18Z

tests/README.md

+The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.
+
+The second positional argument is for "number of GPUs"- e.g.: `./smoketest.sh fsdp 8`. This will run the test with 8 GPUs with fsdp as the distributed backend.
+


This is a little confusing. We have FSDP as the default, so I can use smoketest.sh as-is. But if I want to set the number of GPUs, do I also have to specify FSDP? Since it is the "second positional argument"? Will it break if I just do smoketest.sh 8? We may want to instead just have like --backend/-b and --num-gpus/-n

Yeah it's a little annoying because a bit more boilerplate is required to have that behavior. I would prefer for this to be an issue + feature rather than adding it to this PR just because I want this test to be available asap for some testing stuff elsewhere. Would that be chill or do you want the behavior now?

created an issue for this

Maxusmusti · 2024-10-22T20:50:15Z

tests/smoketest.sh

+    test_standard_loop
+}
+
+main


missing newline at EOF

++ will fix

During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models. Signed-off-by: James Kunstle <[email protected]>

RobotSail · 2024-10-24T21:29:03Z

tests/smoketest.sh

+set -eux -o pipefail
+
+# ############### Read-only parameters ############### 
+MODEL_NAME="instructlab/granite-7b-lab"


Just FYI - the BASH convention is to use ' for constant-value strings and " for strings which you interpolate variables into. Not a big deal here, just something to be aware of.

RobotSail · 2024-10-24T21:30:11Z

tests/smoketest.sh

+#   None
+#######################################
+function setup_tmpdir () {
+    mkdir "$CHECKPOINTS_DIR"


It may be wise here to use mkdir -p in case the parent directories don't exist for whatever reason.

That particular dir will always have extant parents because it's directly descended from the temp dir, so this would appropriately break if something doesn't work

That makes sense, let's keep it as is then.

RobotSail · 2024-10-24T21:31:31Z

tests/smoketest.sh

+    --output_dir="$CHECKPOINTS_DIR" \
+    --num_epochs=1 \
+    --effective_batch_size=128 \
+    --save_samples=0 \


Would it make sense to have this set to a non-zero value so the relevant code-path is tested? I can imagine our spaghetti saving logic potentially breaking and this not picking up on it.

That's a good point, we can consider that in another test- this is data-dependent so we might want to tie it to the number of samples.

Yes, definitely something we can do in the future.

RobotSail

LGTM, a few comments that should be looked at but nothing blocking. Great work!

JamesKunstle assigned danmcp, Maxusmusti, nathan-weinberg and RobotSail Oct 17, 2024

mergify bot added testing Relates to testing ci-failure labels Oct 17, 2024

JamesKunstle force-pushed the smoketests branch from 6115704 to b637955 Compare October 17, 2024 06:16

mergify bot removed the ci-failure label Oct 17, 2024

nathan-weinberg requested review from danmcp, Maxusmusti, nathan-weinberg and RobotSail October 17, 2024 13:14

nathan-weinberg unassigned danmcp, Maxusmusti, nathan-weinberg and RobotSail Oct 17, 2024

RobotSail reviewed Oct 17, 2024

View reviewed changes

danmcp reviewed Oct 17, 2024

View reviewed changes

JamesKunstle force-pushed the smoketests branch from b637955 to 2b0ef02 Compare October 17, 2024 22:33

JamesKunstle requested review from danmcp and RobotSail October 17, 2024 22:33

ktam3 mentioned this pull request Oct 18, 2024

[Epic] Release testing automation and frequent upstream releases instructlab/instructlab#2485

Open

5 tasks

Maxusmusti reviewed Oct 22, 2024

View reviewed changes

This was referenced Oct 23, 2024

Add Dolomite test to smoketests #297

Open

Add named optional parameters to smoketest.sh #300

Open

JamesKunstle force-pushed the smoketests branch from c54a0b9 to 5d6bc76 Compare October 23, 2024 20:29

JamesKunstle force-pushed the smoketests branch from 5d6bc76 to c0f3b44 Compare October 23, 2024 20:29

JamesKunstle requested a review from Maxusmusti October 23, 2024 20:34

RobotSail reviewed Oct 24, 2024

View reviewed changes

RobotSail approved these changes Oct 24, 2024

View reviewed changes

mergify bot added the one-approval label Oct 24, 2024

JamesKunstle mentioned this pull request Oct 24, 2024

Test save_samples path in smoketest #303

Open

Maxusmusti approved these changes Oct 25, 2024

View reviewed changes

mergify bot removed the one-approval label Oct 25, 2024

danmcp approved these changes Oct 25, 2024

View reviewed changes

JamesKunstle removed the request for review from nathan-weinberg October 25, 2024 20:06

mergify bot merged commit 466474a into instructlab:main Oct 25, 2024
6 checks passed

	--max_batch_len="$MAX_BATCH_LEN" \
	--max_batch_len="$MAX_BATCH_LEN"

-setup_tmpdir
+function main() {
+    setup_tmpdir
+    trap "rm -rf $TMP_DIR" EXIT
+    #NOTE (jkunstle): script is run as though it's
+    # in the same source dir as main_ds and data_process.
+    cd "$CORRECT_WORKING_DIR"
+    echo "CURRENT WORKING DIRECTORY: $(pwd)"
+    prepare_data
+    test_standard_loop_noflashattention_nogranite
+    _cleanup_saved_checkpoints
+    test_standard_loop_nongranite
+    _cleanup_saved_checkpoints
+    test_standard_loop
+}
+main


		The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.

		> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.

	> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.
	> [!NOTE]
	> You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.

		The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.

		The second positional argument is for "number of GPUs"- e.g.: `./smoketest.sh fsdp 8`. This will run the test with 8 GPUs with fsdp as the distributed backend.

adds basic smoketests for main_ds and data_process CLI args #280

adds basic smoketests for main_ds and data_process CLI args #280

Conversation

JamesKunstle commented Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobotSail left a comment

Choose a reason for hiding this comment

RobotSail commented Oct 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RobotSail left a comment

Choose a reason for hiding this comment

JamesKunstle commented Oct 17, 2024 •

edited

Loading