diff --git a/docs/content/manual/advanced/addons/2764-vgpu.md b/docs/content/manual/advanced/addons/2764-vgpu.md new file mode 100644 index 000000000..65df82926 --- /dev/null +++ b/docs/content/manual/advanced/addons/2764-vgpu.md @@ -0,0 +1,152 @@ +--- +title: vGPU/SR-IOV GPU +--- + + +* Related issues: [#1661](https://github.com/harvester/harvester/issues/2764) vGPU Support + +# Pre-requisite Enable PCI devices + +1. Create a harvester cluster in bare metal mode. Ensure one of the nodes has NIC separate from the management NIC +1. Go to the management interface of the new cluster +1. Go to Advanced -> PCI Devices +1. Validate that the PCI devices aren't enabled +1. Click the link to enable PCI devices +1. Enable PCI devices in the linked addon page +1. Wait for the status to change to `Deploy Successful` +1. Navigate to the PCI devices page +1. Validate that the PCI devices page is populated/populating with PCI devices + +# Pre-requisite Enable vGPU + +This can only be ran on a bare metal Harvester cluster that has an Nvidida card that support vGPU + +1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it +1. Wait for the status to change to `Deploy Successful` +1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location +1. Navigate to SR-IOV GPU Devices +1. Wait for the GPU to show up in the list +1. Enable the GPU +1. Wait for it to show as enabled and populate with the vGPU devices +1. Navigate to vGPU devices +1. Enable one of the vPGU devices and select a profile +1. Validate that it now shows as enabled + +# Test Cases +These tests were ran on Ubuntu focal KVM live images + +The setup for the VMs were as follows + +```sh +wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb +sudo dpkg -i cuda-keyring_1.1-1_all.deb +sudo apt-get update +sudo apt-get -y install cuda nvidia-cuda-toolkit build-essential + +git clone https://github.com/nvidia/cuda-samples +cd cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm +make +./cudaTensorCoreGemm +``` + +## Add one vGPU on VM creation + +1. Create a VM and select the vGPU in vGPU devices +1. After the VM is created run the pre-reqs as outlined above +1. Run the grid installer for the vGPU driver on the VM +1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` + +### Expected Results + +1. The VM should create successfully +1. The code should execute successfully + +## Add one vGPU after VM creation + +1. Create a VM and don't select the vGPU in vGPU devices +1. After the VM is created run the pre-reqs as outlined above +1. Edit the config of the VM and add the vGPU and select to restart +1. Run the grid installer for the vGPU driver on the VM +1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` + +### Expected Results + +1. The VM should create successfully +1. The code should execute successfully + +## Remove vGPU from VM + +1. Create a VM and select the vGPU in vGPU devices +1. After the VM is created run the pre-reqs as outlined above +1. Run the grid installer for the vGPU driver on the VM +1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` +1. Edit the VM config and remove the vGPU and select to restart +1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` + +### Expected Results + +1. The VM should create successfully +1. The code should execute successfully +1. After removal of the vGPU the code should report that there aren't any cuda compatible cards available. + +## Disable vGPU device that is in use +This is to be run on a vGPU device that is currently assigned to a VM + +1. Disable vGPU device that is in use + +### Expected Results + +1. You should get an error that says that the vGPU is in use + +## Add two vGPUs to VM on creation +There are some limitations to this and they are outlined [here](https://docs.harvesterhci.io/v1.3/advanced/vgpusupport/#attaching-multiple-vgpus) + +1. Create a VM and select two vGPUs in vGPU devices +1. Edit the yaml of the VM to add in the following yaml to one of the vGPUs +```yaml +virtualGPUOptions: + display: + ramFB: + enabled: false +``` +1. After the VM is created run the pre-reqs as outlined above +1. Run the grid installer for the vGPU driver on the VM +1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` + +### Expected Results + +1. The VM should create successfully +1. The code should execute successfully + +## Negative try to provision a VM with an already in use vGPU + +The way that vGPUs work is in a pool of resources so the easiest way to test this is if you only have one vGPU device enabled + +1. Create a VM and try to select the vGPU in vGPU devices + +### Expected Results + +1. No vGPU device should show up in the dropdown + +## Negative try to enable more vGPUs than the card supports + +This is going to vary based on which card you have since the driver itself is loading what's available to the `/sys` tree + +1. Attempt to enable a vGPU device after the card is fully provisioned + +### Expected Results + +1. There should be no profiles available in the dropdown + +## Negative try to enable vGPU on card that doesn't support it + +This should be ran on a server that has an Nvidia GPU that doesn't support vGPU + +1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it +1. Wait for the status to change to `Deploy Successful` +1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location +1. Navigate to SR-IOV GPU Devices + +### Expected Results + +1. The GPU should not be listed in SR-IOV GPU Devices \ No newline at end of file