added in vGPU testing scenarios

Added in negative SR-IOV GPU test
harvester · Apr 18, 2024 · 8be8248 · 8be8248
1 parent 790c494
commit 8be8248
Showing 1 changed file with 152 additions and 0 deletions.
diff --git a/docs/content/manual/advanced/addons/2764-vgpu.md b/docs/content/manual/advanced/addons/2764-vgpu.md
@@ -0,0 +1,152 @@
+---
+title: vGPU/SR-IOV GPU
+---
+
+
+* Related issues: [#1661](https://github.com/harvester/harvester/issues/2764) vGPU Support
+
+# Pre-requisite Enable PCI devices
+
+1. Create a harvester cluster in bare metal mode. Ensure one of the nodes has NIC separate from the management NIC
+1. Go to the management interface of the new cluster
+1. Go to Advanced -> PCI Devices
+1. Validate that the PCI devices aren't enabled
+1. Click the link to enable PCI devices
+1. Enable PCI devices in the linked addon page
+1. Wait for the status to change to `Deploy Successful`
+1. Navigate to the PCI devices page
+1. Validate that the PCI devices page is populated/populating with PCI devices
+
+# Pre-requisite Enable vGPU
+
+This can only be ran on a bare metal Harvester cluster that has an Nvidida card that support vGPU
+
+1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it
+1. Wait for the status to change to `Deploy Successful`
+1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location
+1. Navigate to SR-IOV GPU Devices
+1. Wait for the GPU to show up in the list
+1. Enable the GPU
+1. Wait for it to show as enabled and populate with the vGPU devices
+1. Navigate to vGPU devices
+1. Enable one of the vPGU devices and select a profile
+1. Validate that it now shows as enabled
+
+# Test Cases
+These tests were ran on Ubuntu focal KVM live images
+
+The setup for the VMs were as follows
+
+```sh
+wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
+sudo dpkg -i cuda-keyring_1.1-1_all.deb
+sudo apt-get update
+sudo apt-get -y install cuda nvidia-cuda-toolkit build-essential
+
+git clone https://github.com/nvidia/cuda-samples
+cd cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm
+make
+./cudaTensorCoreGemm
+```
+
+## Add one vGPU on VM creation
+
+1. Create a VM and select the vGPU in vGPU devices
+1. After the VM is created run the pre-reqs as outlined above
+1. Run the grid installer for the vGPU driver on the VM
+1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm`
+
+### Expected Results
+
+1. The VM should create successfully
+1. The code should execute successfully
+
+## Add one vGPU after VM creation
+
+1. Create a VM and don't select the vGPU in vGPU devices
+1. After the VM is created run the pre-reqs as outlined above
+1. Edit the config of the VM and add the vGPU and select to restart
+1. Run the grid installer for the vGPU driver on the VM
+1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm`
+
+### Expected Results
+
+1. The VM should create successfully
+1. The code should execute successfully
+
+## Remove vGPU from VM
+
+1. Create a VM and select the vGPU in vGPU devices
+1. After the VM is created run the pre-reqs as outlined above
+1. Run the grid installer for the vGPU driver on the VM
+1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm`
+1. Edit the VM config and remove the vGPU and select to restart
+1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm`
+
+### Expected Results
+
+1. The VM should create successfully
+1. The code should execute successfully
+1. After removal of the vGPU the code should report that there aren't any cuda compatible cards available.
+
+## Disable vGPU device that is in use
+This is to be run on a vGPU device that is currently assigned to a VM
+
+1. Disable vGPU device that is in use
+
+### Expected Results
+
+1. You should get an error that says that the vGPU is in use
+
+## Add two vGPUs to VM on creation
+There are some limitations to this and they are outlined [here](https://docs.harvesterhci.io/v1.3/advanced/vgpusupport/#attaching-multiple-vgpus)
+
+1. Create a VM and select two vGPUs in vGPU devices
+1. Edit the yaml of the VM to add in the following yaml to one of the vGPUs
+```yaml
+virtualGPUOptions:
+  display:
+     ramFB:
+       enabled: false
+```
+1. After the VM is created run the pre-reqs as outlined above
+1. Run the grid installer for the vGPU driver on the VM
+1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm`
+
+### Expected Results
+
+1. The VM should create successfully
+1. The code should execute successfully
+
+## Negative try to provision a VM with an already in use vGPU
+
+The way that vGPUs work is in a pool of resources so the easiest way to test this is if you only have one vGPU device enabled
+
+1. Create a VM and try to select the vGPU in vGPU devices
+
+### Expected Results
+
+1. No vGPU device should show up in the dropdown
+
+## Negative try to enable more vGPUs than the card supports
+
+This is going to vary based on which card you have since the driver itself is loading what's available to the `/sys` tree
+
+1. Attempt to enable a vGPU device after the card is fully provisioned
+
+### Expected Results
+
+1. There should be no profiles available in the dropdown
+
+## Negative try to enable vGPU on card that doesn't support it
+
+This should be ran on a server that has an Nvidia GPU that doesn't support vGPU
+
+1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it
+1. Wait for the status to change to `Deploy Successful`
+1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location
+1. Navigate to SR-IOV GPU Devices
+
+### Expected Results
+
+1. The GPU should not be listed in SR-IOV GPU Devices