-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data processing optional in run_training() #220
Conversation
9c1e00c
to
9cd2ea6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good addition, we can probably merge over the pylint stuff and fix those things in another PR since they're for code you didn't touch.
Thanks for the review @JamesKunstle! Would it be helpful to open another PR to update the pylint config and make it a little less strict? Looks like it mainly comes down to this value being too low. Line 283 in b37c8ce
|
Thanks for adding this @MichaelClifford! I would just rebase this on main (might fix some linting issues for you for free), and then for the rest you can check via |
This pull request has merge conflicts that must be resolved before it can be |
9cd2ea6
to
736f8cf
Compare
736f8cf
to
d6cb293
Compare
Thanks for the review @Maxusmusti and @JamesKunstle ! Sorry for the delay on the rebase. Should be good now, and looks like all the tests are passing :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelClifford Looks good! Just a couple of quick comments, but nothing blocking really
README.md
Outdated
|
||
``` | ||
|
||
If the machine's above have shared storage, users can preprocess the training dataset a single time so that it can then distributed to each machine with the following update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
machine's -> machines
then distributed -> then be distributed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 😄 done.
d6cb293
to
d4cca46
Compare
@MichaelClifford Awesome, thanks! Oleg is going to give this a spin, and then we'll get this merged today ✅ |
src/instructlab/training/__init__.py
Outdated
@@ -28,9 +28,13 @@ | |||
|
|||
|
|||
# defer import of main_ds | |||
def run_training(torch_args: TorchrunArgs, train_args: TrainingArgs) -> None: | |||
def run_training( | |||
torch_args: TorchrunArgs, train_args: TrainingArgs, process_data: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the process_data
arg out to live under train_args
, unless there's a compelling reason to not do this. Keeping this function simple allows other consuming libraries to have a straightforward interface into our main training loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested locally and it seems to work. Please address the comment about process_data
as an arg - we should keep this either as something that's a part of train_args
or we need to make a really good case for why it shouldn't be. Presently we are trying to make our interface be very simple to make it easy for the CLI & other tools to consume.
Co-authored-by: Michael Clifford <[email protected]> Co-authored-by: Shreyanand <[email protected]> Signed-off-by: Michael Clifford <[email protected]>
Signed-off-by: Michael Clifford <[email protected]>
d4cca46
to
9a7986a
Compare
Signed-off-by: Michael Clifford <[email protected]>
9a7986a
to
f97dca3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR makes running
data_process.main()
optional inrun_training()
. This change is needed since it is not always desirable to process the data inside of the training function. Particularly in distributed training cases where its beneficial to processes the data once prior to training and then distribute the processed data along with the training function to each node.The changes here have been made so that data processing inside of
run_training()
is still the default behavior.I've updated the README.md to reflect how to run data processing independent of
run_training()
.