This module will add additional AWS specific configuration for use with CNPack
-
AWS Managed Prometheus and corresponding IAM roles
-
AWS Private Certificate Authority and corresponding IAM roles
-
Node Role Policy for FluentBit connection
-
Keycloak deployment (leverages AWS CSI driver to create a 1Gb dynamic PV for Keycloak PVC)
-
The AWS user should have permissions to create an Infrastructure and IAM roles and permissions
-
Requires CNPack binary, AWS CLI, Kubectl and awscurl optionally to query Prometheus metrics from the command line
-
The AWS region needs to be configured to a region where Amazon Prometheus, GPU Nodes, and many other resources are available. For example, Amazon Prometheus is available in
us-west-2
but not available inus-west-1
. Please verify by checking your~/.aws/config
to make sure that the region reflected is in one of the available regions. If not, please runaws configure
to configure the region accordingly. We have tested that all of our resources can be created inus-west-2
.
-
From this module run
terraform init
-
Uncomment/add values in the
terraform.tfvars
file in this directory, otherwise you will be prompted at cluster creation time for values such ascluster_name
-
If everything looks correct, run
terraform apply
-
To delete the cluster, run
terraform delete
- Once the cluster is created update your kubeconfig:
aws eks update-kubeconfig --name cnpackcluster --region us-west-2
If you changed the name of the cluster the command is:
aws eks update-kubeconfig --name <cluster-name> --region us-west-2
- Run
terraform output
to get the needed values to populate the CNPack config file
Use the following config file (adding in the outputs from "terraform output") wit CNPack to enable all AWS services tur
apiVersion: v1alpha2
kind: NvidiaPlatform
spec:
platform:
wildcardDomain: "*.holoscandev.nvidia.com"
externalPort: 443
eks:
region: us-west-2
certManager:
enabled: true
awsPCA:
enabled: true
commonName: "cluster.local"
domainName: "cluster.local"
arn: "<aws_pca_arn from 'terraform output'>"
prometheus:
enabled: true
awsRemoteWrite:
url: "<amp_remotewrite_endpoint from 'terraform output'>"
arn: "<amp_ingest_role_arn from 'terraform output'>"
prometheusAdapter:
enabled: true
fluentbit:
enabled: true
trustManager:
enabled: false
keycloak:
enabled: true
databaseStorage:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1G
storageClassName: gp2
grafana:
customHostname: grafana.cluster.local
enabled: false
elastic:
enabled: false
ingress:
enabled: false
postgres:
enabled: true
-
Run
terraform output
to get the outputs from the CNPack cluster example -
Grab the value from the console output for the variable
aws_pca_arn
and enter value incertManager.awsPCA.arn
-
Ensure
awsPCA.enabled
is set totrue
-
Run
cnpack install -f nvidiaplatform.yaml
-
Run
kubectl get po -n nvidia-platform
and check that a pod namednvidia-platform-aws-privateca-issuer-<random-number>
exists
-
To validate the AWS PCA Cluster issuer is installed correctly and ready to issue certificates run
kubectl get awspcaclusterissuers.awspca.cert-manager.io
-
There is a test certificate in this directory
testcert.yaml
, runkubectl apply -f testcert.yaml
, followed bykubectl get cert -A
. UnderREADY
it should beTrue
for the certificatersa-cert-4096
-
Run
terraform output
to get the outputs from the CNPack cluster example -
Grab the value from the console output for the variable
amp_remotewrite_endpoint
and enter value inprometheus.awsRemoteWrite.url
-
Grab the value from the console output for the variable
amp_ingest_role_arn
and enter value inprometheus.awsRemoteWrite.arn
-
Ensure
awsPCA.enabled
is set totrue
-
Run
cnpack install -f nvidiaplatform.yaml
-
Check Prometheus logs by running
kubectl logs -n nvidia-monitoring prometheus-nvidia-prometheus-kube-pro-prometheus-0
. You should see no errors within the prometheus pod. -
Download awscurl -- eg:
pip install awscurl
-
Take the Terraform output for
amp_query_endpoint
and export it as an environment variable with the following:
export AMP_QUERY_ENDPOINT=<amp_query_endpoint>
- Query that the Managed Prometheus is up and running:
awscurl -X POST --region us-west-2 --service aps ${AMP_QUERY_ENDPOINT}\?query=up
- You can view the AWS Managed Prometheus workspace which was created here
-
Ensure
awsPCA.enabled
is set totrue
-
Run
cnpack install -f nvidiaplatform.yaml
- Check that the Fluentbit pod is in a running state:
kubectl get po -n nvidia-monitoring
You should see 2x Running
pods named nvidia-fluentbit-aws-for-fluentbit-<random_number>
-
Head to the AWS Console for CloudWatch Log Groups
-
Search for a log group named
/aws/eks/fluentbit-cloudwatch/workload/<namespace>
. Once you click on this log group, you should see application logs for the entire cluster.
- Error creating Prometheus Workspace - no such host
│ Error: creating Prometheus Workspace: RequestError: send request failed
│ caused by: Post "https://aps.us-west-1.amazonaws.com/workspaces": dial tcp: lookup aps.us-west-1.amazonaws.com on 127.0.0.53:53: no such host
FIX: Please see Requirements#3 to verify that your AWS region is configured correctly.
No requirements.
Name | Version |
---|---|
aws | 4.45.0 |
random | 3.5.1 |
Name | Source | Version |
---|---|---|
holoscan-eks-cluster | ../.. | n/a |
Name | Description | Type | Default | Required |
---|---|---|---|---|
amp_enabled | Set to true to enable, false to disable | bool |
true |
no |
cluster_name | Name of the cluster | string |
n/a | yes |
common_name | Common Name for PCA Creation | string |
"cluster.local" |
no |
fluentbit_enabled | Set to true to enable, false to disable | bool |
true |
no |
metrics_server_enabled | Set to true to enable the network support for Metrics Server, false to disable | bool |
false |
no |
pca_enabled | Set to true to enable, false to disable | bool |
true |
no |
prom_adapter_enabled | Set to true to enable the network support for Prometheus Adapter, false to disable | bool |
true |
no |
Name | Description |
---|---|
amp_ingest_role_arn | n/a |
amp_query_endpoint | Output Prometheus Query Write Endpoint |
amp_remotewrite_endpoint | Output Prometheus Remote Write Endpoint |
aws_pca_arn | Output the PCA Arn for use in CNPack |