This guide outlines the steps to configure the environment required to run benchmark recipes on a Google Kubernetes Engine (GKE) cluster with A3 Mega node pools.
Before you begin, ensure you have completed the following:
-
Create a Google Cloud project with billing enabled.
a. To create a project, see Creating and managing projects. b. To enable billing, see Verify the billing status of your projects.
-
Enabled the following APIs:
-
Requested enough GPU quotas. Each
a3-megagpu-8g
machine has 8 H100 80GB GPUs attached. -
To view quotas, see View the quotas for your project. In the Filter field, select Dimensions(e.g location) and specify
gpu_family:NVIDIA_H100_MEGA
. -
If you don't have enough quota, request a higher quota.
The environment comprises of the following components:
- Client workstation: this is used to prepare, submit, and monitor ML workloads.
- Google Cloud Storage (GCS) Bucket: used for storing datasets and logs.
- Artifact Registry: serves as a private container registry for storing and managing Docker images used in the deployment.
- Google Kubernetes Engine (GKE) Cluster with A3 Mega Node Pools: provides a managed Kubernetes environment to run benchmark recipes.
You have two options, you can use either a local machine or Google Cloud Shell.
We recommend using Google Cloud Shell as it comes with all necessary components pre-installed.
If you prefer to use your local machine, ensure your local machine has the following components installed.
- Google Cloud SDK. To install, see Install the gcloud CLI.
- kubectl. To install, see the kuberenetes documentation.
- Helm. To install, see the Helm documentation.
- Docker. To install, see the Docker documentation.
gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> --no-public-access-prevention
Replace the following:
BUCKET_NAME
: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.BUCKET_LOCATION
: the location of your bucket. The bucket must be located in the same region as the GKE cluster.
-
If you use Cloud KMS for repository encryption, create your artifact registry by using the instructions here.
-
If you don't use Cloud KMS, you can create your repository by using the following command:
gcloud artifacts repositories create <REPOSITORY> \ --repository-format=docker \ --location=<LOCATION> \ --description="<DESCRIPTION>" \
Replace the following:
REPOSITORY
: the name of the repository. For each repository location in a project, repository names must be unique.LOCATION
: the regional or multi-regional location for the repository. You can omit this flag if you set a default region.DESCRIPTION
: a description of the repository. Don't include sensitive data because repository descriptions are not encrypted.
To create a GKE cluster with A3 Mega node pools, GPUDirect-TCPXO, gVNIC, and multi-networking, see Maximize GPU network bandwidth in Standard mode clusters.
This documentation provides detailed instructions on the following tasks:
- Creation of the necessary VPC networks and subnets.
- Creation of a GKE cluster with multi-networking enabled.
- Creation of an A3 Mega node pool with NVIDIA H100 GPUs.
- Installation of the required components for GPUDirect and NCCL plugin.
Once you have set up your GKE cluster with A3 Mega node pools, you can proceed to deploy and run your benchmark recipes.
If you encounter any issues or have questions about this setup, use one of the following resources:
- Consult the official GKE documentation.
- Check the issues section of this repository for known problems and solutions.
- Reach out to Google Cloud support.