Skip to content

Latest commit

 

History

History
58 lines (44 loc) · 1.98 KB

gpu.md

File metadata and controls

58 lines (44 loc) · 1.98 KB

GPU Supporting Cluster of Kubernetes with KOPS for Metaflow

Setting Up the GPU Cluster Nodes

  • Running sh kops_gpu_setup.sh will add GPU Instances to the Cluster.
  • Wait for Nodes to become part of cluster.
  • Running sh gpu_setup/nvidia_one_time_setup.sh will add the NVIDIA Deamonset as a part of Kube Systems.
  • Validate GPUS are Working : kubectl create -f gpu_setup/tf_pod.yml
  • You need to wait for sometime. The Plugin takes time to load and start scheduling.
# Check that nodes are detected to have GPUs
kubectl describe nodes|grep -E 'gpu:\s.*[1-9]'

# Check the logs of the Tensorflow Container to ensure that it ran
kubectl logs tf-gpu

# Show GPU info from within the pod
#   Only works in DevicePlugin mode
kubectl exec -it tf-gpu nvidia-smi

# Show Tensorflow detects GPUs from within the pod.
#   Only works in DevicePlugin mode
kubectl exec -it tf-gpu -- \
  python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'

Specs

  hooks:
    - execContainer:
        image: dcwangmit01/nvidia-device-plugin:0.1.0

Cleanup Tasks

Delete GPU Instance

kops delete ig gpu-nodes

TODO

  • Test Cuda Support for v10.2, 9.1

References