Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds basic smoketests for main_ds and data_process CLI args #280

Merged
merged 1 commit into from
Oct 25, 2024

Conversation

JamesKunstle
Copy link
Contributor

@JamesKunstle JamesKunstle commented Oct 17, 2024

During development it's convenient to be able to run full distributed training, even on a smaller dataset, just to make sure that nothing obviously fails. This will also capture support for flash attention on the machine that it's run on, and for granite models.

Also solves an inconvenience that's been annoying- testing for a given platform involves:

  1. grab image with versions of dependencies for cards
  2. run container
  3. download and install this repo
  4. write dumb versions of these scripts to run training and see if it breaks.

MODEL_NAME="instructlab/granite-7b-lab"
# gets directory of current file.
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
CORRECT_WORKING_DIR="$SCRIPT_DIR/../src/instructlab/training/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JamesKunstle Could you please use "${SCRIPT_DIR}" as the format for referencing shell variables? This will be really useful to have as a consistent format and reduces potential bugs introduced through incorrect delimiting.

--checkpoint_at_epoch \
--accelerate_full_state_at_epoch \
--distributed_training_framework="$DISTRIB_FRAMEWORK" \
--max_batch_len="$MAX_BATCH_LEN" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause it to break if we keep the training \

Suggested change
--max_batch_len="$MAX_BATCH_LEN" \
--max_batch_len="$MAX_BATCH_LEN"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

function prepare_data () {
# preprocesses .jsonl messages data so that it's a valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For documenting shell functions, the recommendation for documenting functions would be:

################################################################
# Preprocesses .jsonl messages data so that it's a valid
# input to the model (inputs tokenized, formatted with mask,
# etc.)
# then, data is trimmed to a determined length to make training
# go faster.
# Inputs:
#     None
# Globals:
#     - SAMPLE_DATA_PATH
#     - DATA_DIR
#     - MODEL_NAME
#     - SAMPLES_TRAINED_ON
#     - COMPUTED_DATA_PATH
# Returns:
#     None
################################################################

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ We should adopt the above convention so that it's easier for people to drop in and build on top of these tests in the future. Otherwise they will need to spend a lot more effort to understand why all of these things exist and what their purpose is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this, it looks great now!

}

# ############### Setup and tests ###############
setup_tmpdir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move this into a main function:

Suggested change
setup_tmpdir
function main() {
setup_tmpdir
trap "rm -rf $TMP_DIR" EXIT
#NOTE (jkunstle): script is run as though it's
# in the same source dir as main_ds and data_process.
cd "$CORRECT_WORKING_DIR"
echo "CURRENT WORKING DIRECTORY: $(pwd)"
prepare_data
test_standard_loop_noflashattention_nogranite
_cleanup_saved_checkpoints
test_standard_loop_nongranite
_cleanup_saved_checkpoints
test_standard_loop
}
main

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


1. No Flash Attention or Granite
2. No Granite but Flash Attention enabled
3. Granite and Flash Attention enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider making this a checkmark list so we can check things off over time:

- [ ] No Flash Attention or Granite
- [ ] No Granite but Flash Attention enabled
- [ ] Granite and Flash Attention enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better

tests/README.md Outdated

The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.

> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github actually has a syntax for this built-in:

Suggested change
> NOTE: You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.
> [!NOTE]
> You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run `pip3 install -e .` to install the package in editable mode.

It renders like this:

Note

You'll need to install the training library to run the test. Inside a virtual environment and at inside the repo, please run pip3 install -e . to install the package in editable mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -0,0 +1,145 @@
#!/usr/bin/env bash
set -eux
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add -o pipefail here as this will cause it to stop if any command within a pipeline fails:

Suggested change
set -eux
set -eux -o pipefail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments. Once they are addressed I will LGTM

@RobotSail
Copy link
Member

The main thing is that to make it easier for new developers to come on board and build on top of what we, we should standardize on the Google Shell Styleguide. This will make it much easier for newcomers to understand how things work, why they're organized how they are, and make changes with confidence.


# ############### User-modifiable parameters ###############
# Change these as needed
NUM_GPUS=8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go ahead and make this a param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


1. No Flash Attention or Granite
2. No Granite but Flash Attention enabled
3. Granite and Flash Attention enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a fourth Dolomite path? I guess this might make more sense after #257

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be smart, we can add that in another iteration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue for this

The testing script can be run without parameters as `./smoketest.sh`. By default, this will run all tests with `FSDP` as the distributed training backend. To change the distributed training backend to the other available option, one can run the script as `./smoketest.sh deepspeed`.

The second positional argument is for "number of GPUs"- e.g.: `./smoketest.sh fsdp 8`. This will run the test with 8 GPUs with fsdp as the distributed backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little confusing. We have FSDP as the default, so I can use smoketest.sh as-is. But if I want to set the number of GPUs, do I also have to specify FSDP? Since it is the "second positional argument"? Will it break if I just do smoketest.sh 8? We may want to instead just have like --backend/-b and --num-gpus/-n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a little annoying because a bit more boilerplate is required to have that behavior. I would prefer for this to be an issue + feature rather than adding it to this PR just because I want this test to be available asap for some testing stuff elsewhere. Would that be chill or do you want the behavior now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue for this

test_standard_loop
}

main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

During development it's convenient to be able to run full distributed
training, even on a smaller dataset, just to make sure that nothing
obviously fails. This will also capture support for flash attention on
the machine that it's run on, and for granite models.

Signed-off-by: James Kunstle <[email protected]>
set -eux -o pipefail

# ############### Read-only parameters ###############
MODEL_NAME="instructlab/granite-7b-lab"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI - the BASH convention is to use ' for constant-value strings and " for strings which you interpolate variables into. Not a big deal here, just something to be aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

# None
#######################################
function setup_tmpdir () {
mkdir "$CHECKPOINTS_DIR"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be wise here to use mkdir -p in case the parent directories don't exist for whatever reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That particular dir will always have extant parents because it's directly descended from the temp dir, so this would appropriately break if something doesn't work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, let's keep it as is then.

--output_dir="$CHECKPOINTS_DIR" \
--num_epochs=1 \
--effective_batch_size=128 \
--save_samples=0 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this set to a non-zero value so the relevant code-path is tested? I can imagine our spaghetti saving logic potentially breaking and this not picking up on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, we can consider that in another test- this is data-dependent so we might want to tie it to the number of samples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely something we can do in the future.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few comments that should be looked at but nothing blocking. Great work!

@mergify mergify bot removed the one-approval label Oct 25, 2024
@JamesKunstle JamesKunstle removed the request for review from nathan-weinberg October 25, 2024 20:06
@mergify mergify bot merged commit 466474a into instructlab:main Oct 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants