Added Ray Train & Pytorch Lightning demo #559

Bobbins228 · 2024-06-10T14:11:07Z

Issue link

RHOAIENG-7805

What changes have been made

Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
Checkout this PR's branch
Run pip install codeflare-sdk
Restart your notebook kernel

Testing

Run through the entire demo notebook.
Test the minio and S3 persistent storage examples separately by following the comments in pytorch_lightning.py

A few things to note:

You must have 1 GPU per worker/head pod
It takes around 5 minutes to complete
This PR should not be merged until the PRs Updated Ray version to 2.20.0 #530 & Added documentation for S3 compatible storage #563 are merged

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

astefanutti · 2024-06-10T15:30:12Z

demo-notebooks/guided-demos/pytorch_lightning.py

+    # Data
+    transform = Compose([ToTensor(), Normalize((0.5,), (0.5,))])
+    data_dir = os.path.join(tempfile.gettempdir(), "data")
+    train_data = FashionMNIST(


Are we sure the data is shared across workers?

Looked into this and I would say probably not after finding out that the DistributedSampler exists.

I will update this script and the llama2 one to make use of the DistrbutedSampler 👍

KPostOffice

Everything seems to work fine AFAICT. I ran into some issues around storage just because I usually use the team's s3 in my own path, but I don't want to muck it up by accidentally polluting the root path

KPostOffice · 2024-06-24T20:03:32Z

demo-notebooks/guided-demos/pytorch_lightning.py

+# Based on https://docs.ray.io/en/latest/train/getting-started-pytorch-lightning.html
+
+"""
+Note: This example requires an S3 compatible storage bucket for distributed training. Please visit our documentation for more information -> https://github.com/project-codeflare/codeflare-sdk/blob/main/docs/s3-compatible-storage.md


How do I configure what path to actually use within the bucket for the distributed training?

I created my own bucket with its own path via AWS and gathered the URI using the UI.
s3://mark-bucket/data/

I was not aware we had a shared bucket but you could create a new folder within it and then copy the URI from there.

varshaprasad96

I made a couple of changes locally to be able to run the content in the notebooks locally, but I'm sure its me missing up with something in the setup 😅

Going to /lgtm and /approve this from my end since it works overall and my workarounds are unrelated! :))

openshift-ci · 2024-07-08T17:24:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varshaprasad96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [varshaprasad96]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Bobbins228 · 2024-07-09T08:33:00Z

@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

varshaprasad96 · 2024-07-09T09:05:38Z

The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos?

Bobbins228 · 2024-07-09T09:12:43Z

On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal.
You can then install the latest version of the SDK as no SDK changes were made just nbs.
Then you would have been able to run the demo within the workbench after restarting the nb kernel.
What steps did you follow?

varshaprasad96 · 2024-07-09T12:13:26Z

I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again!

openshift-ci bot requested review from astefanutti and dimakis June 10, 2024 14:11

Bobbins228 added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2024

astefanutti reviewed Jun 10, 2024

View reviewed changes

Bobbins228 force-pushed the lightning-example branch 2 times, most recently from b29c031 to 705e0cf Compare June 18, 2024 10:16

Added Ray Train & Pytorch Lightning demo

29baf39

Bobbins228 force-pushed the lightning-example branch from 705e0cf to 29baf39 Compare June 21, 2024 14:50

KPostOffice reviewed Jun 24, 2024

View reviewed changes

varshaprasad96 approved these changes Jul 8, 2024

View reviewed changes

openshift-ci bot assigned varshaprasad96 Jul 8, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 8, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Ray Train & Pytorch Lightning demo #559

Added Ray Train & Pytorch Lightning demo #559

Bobbins228 commented Jun 10, 2024 •

edited

Loading

astefanutti Jun 10, 2024

Bobbins228 Jun 10, 2024

KPostOffice left a comment •

edited

Loading

KPostOffice Jun 24, 2024 •

edited

Loading

Bobbins228 Jun 28, 2024

varshaprasad96 left a comment

openshift-ci bot commented Jul 8, 2024

Bobbins228 commented Jul 9, 2024

varshaprasad96 commented Jul 9, 2024

Bobbins228 commented Jul 9, 2024

varshaprasad96 commented Jul 9, 2024 •

edited

Loading

Added Ray Train & Pytorch Lightning demo #559

Are you sure you want to change the base?

Added Ray Train & Pytorch Lightning demo #559

Conversation

Bobbins228 commented Jun 10, 2024 • edited Loading

Issue link

What changes have been made

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Testing

Checks

astefanutti Jun 10, 2024

Choose a reason for hiding this comment

Bobbins228 Jun 10, 2024

Choose a reason for hiding this comment

KPostOffice left a comment • edited Loading

Choose a reason for hiding this comment

KPostOffice Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Bobbins228 Jun 28, 2024

Choose a reason for hiding this comment

varshaprasad96 left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 8, 2024

Bobbins228 commented Jul 9, 2024

varshaprasad96 commented Jul 9, 2024

Bobbins228 commented Jul 9, 2024

varshaprasad96 commented Jul 9, 2024 • edited Loading

Bobbins228 commented Jun 10, 2024 •

edited

Loading

KPostOffice left a comment •

edited

Loading

KPostOffice Jun 24, 2024 •

edited

Loading

varshaprasad96 commented Jul 9, 2024 •

edited

Loading