Model Analyzer provides support deployment on a Kubernetes enabled cluster using helm charts. You can find information about helm charts here.
- A Kubernetes enabled cluster
- Helm
-
Install Kubernetes : Follow the steps in the NVIDIA Kubernetes Installation Docs to install Kubernetes, verify your installation, and troubleshoot any issues.
-
Set Default Container Runtime : Kubernetes does not yet support the
--gpus
options for running Docker containers, so all GPU nodes will need to register thenvidia
runtime as the default for Docker on all GPU nodes. Follow the directions in the NVIDIA Container Toolkit Installation Docs. -
Install NVIDIA Device Plugin : The NVIDIA Device Plugin is also required to use GPUs with Kubernetes. The device plugin provides a daemonset that automatically enumerates the number of GPUs on your worker nodes, and allows pods to run on them. Follow the directions in the NVIDIA Device Plugin Docs to deploy the device plugin on your cluster.
To begin, check that your cluster has all the necessary pods deployed.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5dc87d545c-5c9sp 1/1 Running 0 21m
kube-system calico-node-8dcn5 1/1 Running 0 21m
kube-system coredns-f9fd979d6-9l29n 1/1 Running 0 36m
kube-system coredns-f9fd979d6-mf775 1/1 Running 0 36m
kube-system etcd-user.nvidia.com 1/1 Running 0 36m
kube-system kube-apiserver-user.nvidia.com 1/1 Running 0 36m
kube-system kube-controller-manager-user.nvidia.com 1/1 Running 0 36m
kube-system kube-proxy-zhpv7 1/1 Running 0 36m
kube-system kube-scheduler-user.nvidia.com 1/1 Running 0 36m
kube-system nvidia-device-plugin-1607379880-dblhc 1/1 Running 0 11m
Before deploying the model analyzer, the directories that the container will mount must be specified in helm-chart/values.yaml
.
# Job timeout value specified in seconds
jobTimeout: 900
## Configurations for mounting volumes
# Local path to model directory
modelPath: /home/models
# Local path export model config variants
outputModelPath: /home/output_models
# Local path to export data
resultsPath: /home/results
# Local path to store checkpoints
checkpointPath: /home/checkpoints
## Images
images:
analyzer:
image: model-analyzer
triton:
image: nvcr.io/nvidia/tritonserver
tag: 22.04-py3
The model analyzer executable uses the config file defined in helm-chart/templates/config-map.yaml
. This config can be modified to supply arguments to model analyzer. Only the content under the config.yaml
section of the file should be modified.
apiVersion: v1
kind: ConfigMap
metadata:
name: analyzer-config
namespace: default
data:
config.yaml: |
######################
# Config for profile #
######################
override_output_model_repository: True
run_config_search_disable: True
triton_http_endpoint: localhost:8000
triton_grpc_endpoint: localhost:8001
triton_metrics_url: http://localhost:8002/metrics
concurrency: 1,2
batch_sizes: 1
profile_models:
resnet50_libtorch:
model_config_parameters:
instance_group:
-
kind: KIND_GPU
count: [1]
dynamic_batching:
######################
# Config for analyze #
######################
num_configs_per_model: 3
analysis_models:
resnet50_libtorch:
objectives:
perf_throughput: 10
constraints:
perf_latency_p99:
max: 15
######################
# Config for report #
######################
report_model_configs:
- resnet50_libtorch_i0
Now from the Model Analyzer root directory, we can deploy the helm chart.
~/model_analyzer$ helm install model-analyzer helm-chart
NAME: model-analyzer
LAST DEPLOYED: Mon Dec 7 15:09:14 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
Check that the model analyzer pod is running.
~/model_analyzer$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default model-analyzer-model-analyzer-t9rsl 1/1 Running 0 23s
kube-system calico-kube-controllers-5dc87d545c-5c9sp 1/1 Running 0 54m
kube-system calico-node-8dcn5 1/1 Running 0 54m
kube-system coredns-f9fd979d6-9l29n 1/1 Running 0 69m
kube-system coredns-f9fd979d6-mf775 1/1 Running 0 69m
kube-system etcd-user.nvidia.com 1/1 Running 0 69m
kube-system kube-apiserver-user.nvidia.com 1/1 Running 0 69m
kube-system kube-controller-manager-user.nvidia.com 1/1 Running 0 69m
kube-system kube-proxy-zhpv7 1/1 Running 0 69m
kube-system kube-scheduler-user.nvidia.com 1/1 Running 0 69m
kube-system nvidia-device-plugin-1607379880-dblhc 1/1 Running 0 44m
You can find the results upon completion of the job in the directory passed as the resultsPath
in helm-chart/values.yaml
.
~/model_analyzer$ ls -l /home/results
total 12
drwxr-xr-x 4 root root 4096 Jun 2 17:00 plots
drwxr-xr-x 4 root root 4096 Jun 2 17:00 reports
drwxr-xr-x 2 root root 4096 Jun 2 17:00 results
~/model_analyzer$ cat /home/results/results/*
Model,GPU ID,Batch,Concurrency,Model Config Path,Instance Group,Preferred Batch Sizes,Satisfies Constraints,GPU Memory Usage (MB),GPU Utilization (%),GPU Power Usage (W)
resnet50_libtorch,0,1,2,resnet50_libtorch_i0,1/GPU,[32],Yes,1099.0,16.2,85.3
resnet50_libtorch,0,1,1,resnet50_libtorch_i0,1/GPU,[32],Yes,1099.0,14.4,82.2
Model,Batch,Concurrency,Model Config Path,Instance Group,Preferred Batch Sizes,Satisfies Constraints,Throughput (infer/sec),p99 Latency (ms),RAM Usage (MB)
resnet50_libtorch,1,2,resnet50_libtorch_i0,1/GPU,[32],Yes,195.0,11.0,2897.0
resnet50_libtorch,1,1,resnet50_libtorch_i0,1/GPU,[32],Yes,164.0,8.1,2937.0
Model,GPU ID,GPU Memory Usage (MB),GPU Utilization (%),GPU Power Usage (W)
triton-server,0,277.0,0.0,56.5