Infrastructure as code for GPU accelerated managed Kubernetes clusters. These scripts automate the deployment of GPU-Enabled Kubernetes clusters on various cloud service platforms.
Terraform is an open-source infrastructure as code software tool that we will use to automate the deployment of Kubernetes clusters with the required add-ons to enable NVIDIA GPUs. This repository contains Terraform modules, which are sets of Terraform configuration files ready for deployment. The modules in this repository can be incorporated into existing Terraform-managed infrastructure, or used to set up new infrastructure from scratch. You can learn more about Terraform here.
You can download Terraform (CLI) here.
NVIDIA offers support for Kubernetes through NVIDIA AI Enterprise. Refer to the product support matrix for supported managed Kubernetes platforms.
The Kubernetes clusters provisioned by the modules in this repository provide tested and certified versions of Kubernetes, the NVIDIA GPU operator, and the NVIDIA Driver.
If your application does not require a specific version of Kubernetes, we recommend using the latest available version. We also recommend you plan to upgrade your version of Kubernetes at least every 6 months.
Each CSP has its own end of life date for the versions of Kubernetes they support. For more information see:
Version | Release Date | Kubernetes Versions | NVIDIA GPU Operator | NVIDIA Data Center Driver | End of Life |
---|---|---|---|---|---|
0.2.0 | August 2023 | EKS - 1.26 GKE - 1.26 AKS - 1.26 |
23.3.2 | 535.54.03 (EKS & GKE) | EKS - June 2024 GKE - June 2024 AKS - March 2024 |
0.1.0 | June 2023 | EKS - 1.26 GKE - 1.26 AKS - 1.26 |
23.3.2 | 525.105.17 | EKS - June 2024 GKE - June 2024 AKS - March 2024 |
- Create an EKS Cluster
- Create an AKS Cluster
- Create a GKE Cluster
Call the EKS module by adding this to an existing Terraform file:
module "nvidia-eks" {
source = "git::github.com/nvidia/nvidia-terraform-modules/eks"
cluster_name = "nvidia-eks"
}
See the EKS README for all available configuration options.
Call the AKS module by adding this to an existing Terraform file:
module "nvidia-aks" {
source = "git::github.com/NVIDIA/nvidia-terraform-modules/aks"
cluster_name = "nvidia-aks-cluster"
admin_group_object_ids = [] # See description of this value in the AKS Readme
location = "us-west1"
}
See the AKS README for all available configuration options.
Call the GKE module by adding this to an existing Terraform file:
module "nvidia-gke" {
source = "git::github.com/NVIDIA/nvidia-terraform-modules/gke"
cluster_name = "nvidia-gke-cluster"
project_id = "your-gcp-project-id"
region = "us-west1"
node_zones = ["us-west1-a"]
}
See the GKE README for all available configuration options.
In each subdirectory, there is a Terraform module to provision the Kubernetes cluster and any additional prerequisite cloud infrastructure to launch CNPack. See CNPack on EKS, CNPack on GKE, and CNPack on AKS for more information and the sample CNPack configuration file.
More information on CNPack can be found on the NVIDIA AI Enterprise Documentation
These modules do not set up state management for the generated Terraform state file, deleting the statefile (terraform.tfstate
) generated by Terraform could result in cloud resources needing to be manually deleted. We strongly encourage you configure remote state.
Please see the Terraform Documentation for more information.
Pull requests are welcome! Please see our contribution guidelines.
Please open an issue on the GitHub project for any questions. Your feedback is appreciated.