Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Ray Train & Pytorch Lightning demo #559

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Jun 10, 2024

Issue link

RHOAIENG-7805

What changes have been made

Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run pip install codeflare-sdk
  • Restart your notebook kernel

Testing

Run through the entire demo notebook.
Test the minio and S3 persistent storage examples separately by following the comments in pytorch_lightning.py

A few things to note:

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot requested review from astefanutti and dimakis June 10, 2024 14:11
@Bobbins228 Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2024
# Data
transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
data_dir = os.path.join(tempfile.gettempdir(), "data")
train_data = FashionMNIST(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure the data is shared across workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into this and I would say probably not after finding out that the DistributedSampler exists.

I will update this script and the llama2 one to make use of the DistrbutedSampler 👍

@Bobbins228 Bobbins228 force-pushed the lightning-example branch 2 times, most recently from b29c031 to 705e0cf Compare June 18, 2024 10:16
Copy link
Collaborator

@KPostOffice KPostOffice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems to work fine AFAICT. I ran into some issues around storage just because I usually use the team's s3 in my own path, but I don't want to muck it up by accidentally polluting the root path

# Based on https://docs.ray.io/en/latest/train/getting-started-pytorch-lightning.html

"""
Note: This example requires an S3 compatible storage bucket for distributed training. Please visit our documentation for more information -> https://github.com/project-codeflare/codeflare-sdk/blob/main/docs/s3-compatible-storage.md
Copy link
Collaborator

@KPostOffice KPostOffice Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I configure what path to actually use within the bucket for the distributed training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created my own bucket with its own path via AWS and gathered the URI using the UI.
s3://mark-bucket/data/

I was not aware we had a shared bucket but you could create a new folder within it and then copy the URI from there.

Copy link
Contributor

@varshaprasad96 varshaprasad96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a couple of changes locally to be able to run the content in the notebooks locally, but I'm sure its me missing up with something in the setup 😅

Going to /lgtm and /approve this from my end since it works overall and my workarounds are unrelated! :))

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2024
Copy link
Contributor

openshift-ci bot commented Jul 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varshaprasad96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2024
@Bobbins228
Copy link
Contributor Author

@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

@varshaprasad96
Copy link
Contributor

The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos?

@Bobbins228
Copy link
Contributor Author

On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal.
You can then install the latest version of the SDK as no SDK changes were made just nbs.
Then you would have been able to run the demo within the workbench after restarting the nb kernel.
What steps did you follow?

@varshaprasad96
Copy link
Contributor

varshaprasad96 commented Jul 9, 2024

I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants