Name		Name	Last commit message	Last commit date
parent directory ..
.terraform.lock.hcl		.terraform.lock.hcl
Readme.md		Readme.md
aws-amp.tf		aws-amp.tf
aws-fluentbit.tf		aws-fluentbit.tf
aws-pca.tf		aws-pca.tf
main.tf		main.tf
terraform.tfvars		terraform.tfvars
testcert.yaml		testcert.yaml
variables.tf		variables.tf

Readme.md

CNPack EKS Module

This module will add additional AWS specific configuration for use with CNPack

Resources Created

AWS Managed Prometheus and corresponding IAM roles
AWS Private Certificate Authority and corresponding IAM roles
Node Role Policy for FluentBit connection
Keycloak deployment (leverages AWS CSI driver to create a 1Gb dynamic PV for Keycloak PVC)

Requirements

The AWS user should have permissions to create an Infrastructure and IAM roles and permissions
Requires CNPack binary, AWS CLI, Kubectl and awscurl optionally to query Prometheus metrics from the command line
The AWS region needs to be configured to a region where Amazon Prometheus, GPU Nodes, and many other resources are available. For example, Amazon Prometheus is available in us-west-2 but not available in us-west-1. Please verify by checking your ~/.aws/config to make sure that the region reflected is in one of the available regions. If not, please run aws configure to configure the region accordingly. We have tested that all of our resources can be created in us-west-2.

Usage

From this module run terraform init
Uncomment/add values in the terraform.tfvars file in this directory, otherwise you will be prompted at cluster creation time for values such as cluster_name
If everything looks correct, run terraform apply
To delete the cluster, run terraform delete

Running CNPack with the CNPack Holoscan Cluster

Once the cluster is created update your kubeconfig:

aws eks update-kubeconfig --name  cnpackcluster  --region us-west-2

If you changed the name of the cluster the command is:

aws eks update-kubeconfig --name  <cluster-name>  --region us-west-2

Run terraform output to get the needed values to populate the CNPack config file

Sample Config File

Use the following config file (adding in the outputs from "terraform output") wit CNPack to enable all AWS services tur

apiVersion: v1alpha2
kind: NvidiaPlatform
spec:
  platform:
    wildcardDomain: "*.holoscandev.nvidia.com"
    externalPort: 443
    eks:
      region: us-west-2
  certManager:
    enabled: true
    awsPCA:
      enabled: true
      commonName: "cluster.local"
      domainName: "cluster.local"
      arn: "<aws_pca_arn from 'terraform output'>"
  prometheus:
    enabled: true
    awsRemoteWrite:
      url: "<amp_remotewrite_endpoint from 'terraform output'>"
      arn: "<amp_ingest_role_arn from 'terraform output'>"
  prometheusAdapter:
    enabled: true
  fluentbit:
    enabled: true
  trustManager:
    enabled: false
  keycloak:
    enabled: true
    databaseStorage:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 1G
      storageClassName: gp2
  grafana:
    customHostname: grafana.cluster.local
    enabled: false
  elastic:
    enabled: false
  ingress:
    enabled: false
  postgres:
    enabled: true

Certmanager with AWS PCA Plugin

Usage

Run terraform output to get the outputs from the CNPack cluster example
Grab the value from the console output for the variable aws_pca_arn and enter value in certManager.awsPCA.arn
Ensure awsPCA.enabled is set to true
Run cnpack install -f nvidiaplatform.yaml
Run kubectl get po -n nvidia-platform and check that a pod named nvidia-platform-aws-privateca-issuer-<random-number> exists

Validation

To validate the AWS PCA Cluster issuer is installed correctly and ready to issue certificates run kubectl get awspcaclusterissuers.awspca.cert-manager.io
There is a test certificate in this directory testcert.yaml, run kubectl apply -f testcert.yaml, followed by kubectl get cert -A. Under READY it should be True for the certificate rsa-cert-4096

AWS Managed Prometheus (AMP)

Usage

Run terraform output to get the outputs from the CNPack cluster example
Grab the value from the console output for the variable amp_remotewrite_endpoint and enter value in prometheus.awsRemoteWrite.url
Grab the value from the console output for the variable amp_ingest_role_arn and enter value in prometheus.awsRemoteWrite.arn
Ensure awsPCA.enabled is set to true
Run cnpack install -f nvidiaplatform.yaml

Validation

Check Prometheus logs by running kubectl logs -n nvidia-monitoring prometheus-nvidia-prometheus-kube-pro-prometheus-0. You should see no errors within the prometheus pod.
Download awscurl -- eg: pip install awscurl
Take the Terraform output for amp_query_endpoint and export it as an environment variable with the following:

export AMP_QUERY_ENDPOINT=<amp_query_endpoint>

Query that the Managed Prometheus is up and running:

awscurl -X POST --region us-west-2 --service aps ${AMP_QUERY_ENDPOINT}\?query=up

You can view the AWS Managed Prometheus workspace which was created here

Flutentbit to CloudWatch Logging

Usage

Ensure awsPCA.enabled is set to true
Run cnpack install -f nvidiaplatform.yaml

Validation

Check that the Fluentbit pod is in a running state:

kubectl get po -n nvidia-monitoring

You should see 2x Running pods named nvidia-fluentbit-aws-for-fluentbit-<random_number>

Head to the AWS Console for CloudWatch Log Groups
Search for a log group named /aws/eks/fluentbit-cloudwatch/workload/<namespace>. Once you click on this log group, you should see application logs for the entire cluster.

Troubleshooting

Error creating Prometheus Workspace - no such host

│ Error: creating Prometheus Workspace: RequestError: send request failed

│ caused by: Post "https://aps.us-west-1.amazonaws.com/workspaces": dial tcp: lookup aps.us-west-1.amazonaws.com on 127.0.0.53:53: no such host

FIX: Please see Requirements#3 to verify that your AWS region is configured correctly.

Requirements

No requirements.

Providers

Name	Version
aws	4.45.0
random	3.5.1

Modules

Name	Source	Version
holoscan-eks-cluster	../..	n/a

Resources

Name	Type
aws_acmpca_certificate.cnpack-pca	resource
aws_acmpca_certificate_authority.cnpack-pca	resource
aws_acmpca_certificate_authority_certificate.cnpack-pca	resource
aws_acmpca_permission.cnpack-pca	resource
aws_cloudwatch_log_group.cnpack-log-group	resource
aws_iam_policy.amp-ingest-policy	resource
aws_iam_policy.pca-policy	resource
aws_iam_role.amp-ingest-role	resource
aws_iam_role_policy_attachment.attach-amp-policy-to-gpu-ng	resource
aws_iam_role_policy_attachment.attach-amp-role-to-cpu-ng	resource
aws_iam_role_policy_attachment.attach-amp-role-to-policy	resource
aws_iam_role_policy_attachment.attach-cloudwatch-to-cpu-ng	resource
aws_iam_role_policy_attachment.attach-cloudwatch-to-gpu-ng	resource
aws_iam_role_policy_attachment.attach-cpu-node-policy	resource
aws_iam_role_policy_attachment.attach-gpu-node-policy	resource
aws_prometheus_workspace.cnpack-prom-workspace	resource
random_string.amp	resource
random_string.pca	resource
aws_caller_identity.current	data source
aws_iam_policy.cloudwatch-agent-server-policy	data source
aws_partition.current	data source

Inputs

Name	Description	Type	Default	Required
amp_enabled	Set to true to enable, false to disable	`bool`	`true`	no
cluster_name	Name of the cluster	`string`	n/a	yes
common_name	Common Name for PCA Creation	`string`	`"cluster.local"`	no
fluentbit_enabled	Set to true to enable, false to disable	`bool`	`true`	no
metrics_server_enabled	Set to true to enable the network support for Metrics Server, false to disable	`bool`	`false`	no
pca_enabled	Set to true to enable, false to disable	`bool`	`true`	no
prom_adapter_enabled	Set to true to enable the network support for Prometheus Adapter, false to disable	`bool`	`true`	no

Outputs

Name	Description
amp_ingest_role_arn	n/a
amp_query_endpoint	Output Prometheus Query Write Endpoint
amp_remotewrite_endpoint	Output Prometheus Remote Write Endpoint
aws_pca_arn	Output the PCA Arn for use in CNPack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cnpack

cnpack

Readme.md

CNPack EKS Module

Resources Created

Requirements

Usage

Running CNPack with the CNPack Holoscan Cluster

Sample Config File

Certmanager with AWS PCA Plugin

Usage

Validation

AWS Managed Prometheus (AMP)

Usage

Validation

Flutentbit to CloudWatch Logging

Usage

Validation

Troubleshooting

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

cnpack

Directory actions

More options

Directory actions

More options

Latest commit

History

cnpack

Folders and files

parent directory

Readme.md

CNPack EKS Module

Resources Created

Requirements

Usage

Running CNPack with the CNPack Holoscan Cluster

Sample Config File

Certmanager with AWS PCA Plugin

Usage

Validation

AWS Managed Prometheus (AMP)

Usage

Validation

Flutentbit to CloudWatch Logging

Usage

Validation

Troubleshooting

Requirements

Providers

Modules

Resources

Inputs

Outputs