-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Ray Train & Pytorch Lightning demo #559
base: main
Are you sure you want to change the base?
Conversation
# Data | ||
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))]) | ||
data_dir = os.path.join(tempfile.gettempdir(), "data") | ||
train_data = FashionMNIST( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the data is shared across workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into this and I would say probably not after finding out that the DistributedSampler exists.
I will update this script and the llama2 one to make use of the DistrbutedSampler 👍
b29c031
to
705e0cf
Compare
705e0cf
to
29baf39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything seems to work fine AFAICT. I ran into some issues around storage just because I usually use the team's s3 in my own path, but I don't want to muck it up by accidentally polluting the root path
# Based on https://docs.ray.io/en/latest/train/getting-started-pytorch-lightning.html | ||
|
||
""" | ||
Note: This example requires an S3 compatible storage bucket for distributed training. Please visit our documentation for more information -> https://github.com/project-codeflare/codeflare-sdk/blob/main/docs/s3-compatible-storage.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do I configure what path to actually use within the bucket for the distributed training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created my own bucket with its own path via AWS and gathered the URI using the UI.
s3://mark-bucket/data/
I was not aware we had a shared bucket but you could create a new folder within it and then copy the URI from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a couple of changes locally to be able to run the content in the notebooks locally, but I'm sure its me missing up with something in the setup 😅
Going to /lgtm and /approve this from my end since it works overall and my workarounds are unrelated! :))
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: varshaprasad96 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change? |
That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos? |
On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal. |
I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again! |
Issue link
RHOAIENG-7805
What changes have been made
Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.
Verification steps
Setup
Notebook server ODH/RHOAI/Local
git clone https://github.com/project-codeflare/codeflare-sdk.git
pip install codeflare-sdk
Testing
Run through the entire demo notebook.
Test the minio and S3 persistent storage examples separately by following the comments in
pytorch_lightning.py
A few things to note:
Checks