Skip to content

Commit

Permalink
Pathways integration with XPK (#74)
Browse files Browse the repository at this point in the history
* Initial changes for Pathways on XPK.

* User job becomes a part of the Pathways JobSet

* User job for Pathways is provided as the docker image.

* Cleaning up user args.

* Adding successPolicy, making job names unique to the workload.

* Remove extra user workload image args.

* Fixing a bug around XPK internal command parsing.

* GCS bucket as a parameter, some cleanup.

* Small YAML change to admit regular workload run in Pathways enabled cluster.

* Workload delete remains the same, updated Pathways image names.

* Updating kueue limits, marking command as required.

* Bumping Jobset to 0.4.0, moving exclusive topology to worker replicatedJob, enabling autoscaling for CPU nodepools.

* Including doc strings.

* Check cluster is Pathways enabled while accepting Pathways workload. Enable subnetwork for Pathways cluster only.

* Adding instructions to README for Pathways - XPK.

* Improve readability, restrict pw to TPUs, fix help.

* Move helpers back to xpk.py, add helpers for resource flavors.
  • Loading branch information
RoshaniN authored Mar 19, 2024
1 parent 5890e0c commit afc4196
Show file tree
Hide file tree
Showing 2 changed files with 582 additions and 44 deletions.
45 changes: 43 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,16 @@ all zones.
--num-slices=4 --spot
```

* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.

Expand Down Expand Up @@ -196,8 +206,39 @@ all zones.

```shell
python3 xpk.py workload create \
--workload xpk-test-workload --command "echo goodbye" --cluster \
xpk-test --tpu-type=v5litepod-16
--workload xpk-test-workload --command "echo goodbye" \
--cluster xpk-test \
--tpu-type=v5litepod-16
```

* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--use-pathways \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

### Set `max-restarts` for production jobs
Expand Down
Loading

0 comments on commit afc4196

Please sign in to comment.