-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds basic smoketests for main_ds and data_process CLI args #280
Conversation
6115704
to
b637955
Compare
tests/smoketest.sh
Outdated
MODEL_NAME="instructlab/granite-7b-lab" | ||
# gets directory of current file. | ||
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" | ||
CORRECT_WORKING_DIR="$SCRIPT_DIR/../src/instructlab/training/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JamesKunstle Could you please use "${SCRIPT_DIR}"
as the format for referencing shell variables? This will be really useful to have as a consistent format and reduces potential bugs introduced through incorrect delimiting.
tests/smoketest.sh
Outdated
--checkpoint_at_epoch \ | ||
--accelerate_full_state_at_epoch \ | ||
--distributed_training_framework="$DISTRIB_FRAMEWORK" \ | ||
--max_batch_len="$MAX_BATCH_LEN" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause it to break if we keep the training \
--max_batch_len="$MAX_BATCH_LEN" \ | |
--max_batch_len="$MAX_BATCH_LEN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
function prepare_data () { | ||
# preprocesses .jsonl messages data so that it's a valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For documenting shell functions, the recommendation for documenting functions would be:
################################################################
# Preprocesses .jsonl messages data so that it's a valid
# input to the model (inputs tokenized, formatted with mask,
# etc.)
# then, data is trimmed to a determined length to make training
# go faster.
# Inputs:
# None
# Globals:
# - SAMPLE_DATA_PATH
# - DATA_DIR
# - MODEL_NAME
# - SAMPLES_TRAINED_ON
# - COMPUTED_DATA_PATH
# Returns:
# None
################################################################
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ We should adopt the above convention so that it's easier for people to drop in and build on top of these tests in the future. Otherwise they will need to spend a lot more effort to understand why all of these things exist and what their purpose is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this, it looks great now!
tests/smoketest.sh
Outdated
} | ||
|
||
# ############### Setup and tests ############### | ||
setup_tmpdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably move this into a main
function:
setup_tmpdir | |
function main() { | |
setup_tmpdir | |
trap "rm -rf $TMP_DIR" EXIT | |
#NOTE (jkunstle): script is run as though it's | |
# in the same source dir as main_ds and data_process. | |
cd "$CORRECT_WORKING_DIR" | |
echo "CURRENT WORKING DIRECTORY: $(pwd)" | |
prepare_data | |
test_standard_loop_noflashattention_nogranite | |
_cleanup_saved_checkpoints | |
test_standard_loop_nongranite | |
_cleanup_saved_checkpoints | |
test_standard_loop | |
} | |
main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
1. No Flash Attention or Granite | ||
2. No Granite but Flash Attention enabled | ||
3. Granite and Flash Attention enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider making this a checkmark list so we can check things off over time:
- [ ] No Flash Attention or Granite
- [ ] No Granite but Flash Attention enabled
- [ ] Granite and Flash Attention enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better
tests/README.md
Outdated
|
||
The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`. | ||
|
||
> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github actually has a syntax for this built-in:
> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode. | |
> [!NOTE] | |
> You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode. |
It renders like this:
Note
You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run pip3 install -e .
to install the package in editable mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
tests/smoketest.sh
Outdated
@@ -0,0 +1,145 @@ | |||
#!/usr/bin/env bash | |||
set -eux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add -o pipefail
here as this will cause it to stop if any command within a pipeline fails:
set -eux | |
set -eux -o pipefail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor comments. Once they are addressed I will LGTM
The main thing is that to make it easier for new developers to come on board and build on top of what we, we should standardize on the Google Shell Styleguide. This will make it much easier for newcomers to understand how things work, why they're organized how they are, and make changes with confidence. |
tests/smoketest.sh
Outdated
|
||
# ############### User-modifiable parameters ############### | ||
# Change these as needed | ||
NUM_GPUS=8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go ahead and make this a param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
b637955
to
2b0ef02
Compare
|
||
1. No Flash Attention or Granite | ||
2. No Granite but Flash Attention enabled | ||
3. Granite and Flash Attention enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a fourth Dolomite path? I guess this might make more sense after #257
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that'd be smart, we can add that in another iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue for this
The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`. | ||
|
||
The second positional argument is for "number of GPUs"- e.g.: `./smoketest.sh fsdp 8`. This will run the test with 8 GPUs with fsdp as the distributed backend. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little confusing. We have FSDP as the default, so I can use smoketest.sh
as-is. But if I want to set the number of GPUs, do I also have to specify FSDP? Since it is the "second positional argument"? Will it break if I just do smoketest.sh 8
? We may want to instead just have like --backend/-b and --num-gpus/-n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a little annoying because a bit more boilerplate is required to have that behavior. I would prefer for this to be an issue + feature rather than adding it to this PR just because I want this test to be available asap for some testing stuff elsewhere. Would that be chill or do you want the behavior now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue for this
tests/smoketest.sh
Outdated
test_standard_loop | ||
} | ||
|
||
main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline at EOF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
c54a0b9
to
5d6bc76
Compare
During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models. Signed-off-by: James Kunstle <[email protected]>
5d6bc76
to
c0f3b44
Compare
set -eux -o pipefail | ||
|
||
# ############### Read-only parameters ############### | ||
MODEL_NAME="instructlab/granite-7b-lab" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI - the BASH convention is to use '
for constant-value strings and "
for strings which you interpolate variables into. Not a big deal here, just something to be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
# None | ||
####################################### | ||
function setup_tmpdir () { | ||
mkdir "$CHECKPOINTS_DIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be wise here to use mkdir -p
in case the parent directories don't exist for whatever reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That particular dir will always have extant parents because it's directly descended from the temp dir, so this would appropriately break if something doesn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, let's keep it as is then.
--output_dir="$CHECKPOINTS_DIR" \ | ||
--num_epochs=1 \ | ||
--effective_batch_size=128 \ | ||
--save_samples=0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have this set to a non-zero value so the relevant code-path is tested? I can imagine our spaghetti saving logic potentially breaking and this not picking up on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, we can consider that in another test- this is data-dependent so we might want to tie it to the number of samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely something we can do in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few comments that should be looked at but nothing blocking. Great work!
During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models.
Also solves an inconvenience that's been annoying- testing for a given platform involves: