Skip to content

Commit

Permalink
fix: improve SkyPilot tutorial (#907)
Browse files Browse the repository at this point in the history
Change-Id: Ie6fbbc7b456bd2fdfa7f3f2df5ca3ea6d77911c2
  • Loading branch information
genlu2011 authored Dec 11, 2024
1 parent 51bf3dc commit 1505921
Showing 1 changed file with 18 additions and 3 deletions.
21 changes: 18 additions & 3 deletions tutorials-and-examples/skypilot/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ SkyPilot is a framework for running AI and batch workloads on any infra, offerin
## The tutorial overview
In this tutorial, our persona is an ML scientist planning to run a batch workload for hyperparameter tuning. This workload involves two experiments, with each experiment requiring 4 GPUs to execute.

We have two GKE clusters in different regions: one in us-central1 with 4*A100 and another in us-west1 with 4*L4.
We have two GKE clusters in different regions: one in us-central1 with 4\*A100 and another in us-west1 with 4\*L4.

By the end of this tutorial, our goal is to have one experiment running in the us-central1 cluster and the other in the us-west1 cluster, demonstrating efficient resource distribution across regions.

Expand Down Expand Up @@ -88,10 +88,24 @@ cd ai-on-gke/tutorials-and-examples/skypilot
```bash
pip install -U "skypilot[kubernetes,gcp]"
```
You can find the available GPUs in a GKE cluster.

Verify the installation:
```bash
sky check
```

This will produce a summary like:
```
Checking credentials to enable clouds for SkyPilot.
GCP: enabled
Kubernetes: enabled
SkyPilot will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
```
If you encounter an error, please consult the [offical documentation](https://docs.skypilot.co/en/latest/getting-started/installation.html).

You can find the available GPUs in a GKE cluster.
```bash
sky show-gpus
```

Expand Down Expand Up @@ -120,7 +134,7 @@ kubernetes:
## Launch the jobs
Under `~/skypilot-test/ai-on-gke/tutorials-and-examples/skypilot`, you’ll find a file named `train.yaml`, which uses SkyPilot's syntax to define a job.
In the resource section, the job asks for 4* A100 first. If no capacity is found, it failovers to L4.
In the resource section, the job asks for 4\*A100 first. If no capacity is found, it failovers to L4. You can find other supported syntax in SkyPilot to specify resource choices in the documentation [here](https://docs.skypilot.co/en/latest/examples/auto-failover.html#multiple-candidate-resources).
```yaml
resources:
cloud: kubernetes
Expand All @@ -130,6 +144,7 @@ resources:

The `launch.py` a Python program that initiates a hyperparameter tuning process with two jobs for the learning rate (LR) parameter. In production environments, such experiments are typically tracked using open-source frameworks like MLFlow.

SkyPilot offers support for launching hyperparameter tuning tasks through its CLI using the `sky launch` command. For more details, refer to the [official documentation](https://docs.skypilot.co/en/latest/running-jobs/many-jobs.html#with-cli-and-config-files).
Start the training:
```bash
python launch.py
Expand Down

0 comments on commit 1505921

Please sign in to comment.