-
Notifications
You must be signed in to change notification settings - Fork 32
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
1 changed file
with
152 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
--- | ||
title: vGPU/SR-IOV GPU | ||
--- | ||
|
||
|
||
* Related issues: [#1661](https://github.com/harvester/harvester/issues/2764) vGPU Support | ||
|
||
# Pre-requisite Enable PCI devices | ||
|
||
1. Create a harvester cluster in bare metal mode. Ensure one of the nodes has NIC separate from the management NIC | ||
1. Go to the management interface of the new cluster | ||
1. Go to Advanced -> PCI Devices | ||
1. Validate that the PCI devices aren't enabled | ||
1. Click the link to enable PCI devices | ||
1. Enable PCI devices in the linked addon page | ||
1. Wait for the status to change to `Deploy Successful` | ||
1. Navigate to the PCI devices page | ||
1. Validate that the PCI devices page is populated/populating with PCI devices | ||
|
||
# Pre-requisite Enable vGPU | ||
|
||
This can only be ran on a bare metal Harvester cluster that has an Nvidida card that support vGPU | ||
|
||
1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it | ||
1. Wait for the status to change to `Deploy Successful` | ||
1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location | ||
1. Navigate to SR-IOV GPU Devices | ||
1. Wait for the GPU to show up in the list | ||
1. Enable the GPU | ||
1. Wait for it to show as enabled and populate with the vGPU devices | ||
1. Navigate to vGPU devices | ||
1. Enable one of the vPGU devices and select a profile | ||
1. Validate that it now shows as enabled | ||
|
||
# Test Cases | ||
These tests were ran on Ubuntu focal KVM live images | ||
|
||
The setup for the VMs were as follows | ||
|
||
```sh | ||
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb | ||
sudo dpkg -i cuda-keyring_1.1-1_all.deb | ||
sudo apt-get update | ||
sudo apt-get -y install cuda nvidia-cuda-toolkit build-essential | ||
|
||
git clone https://github.com/nvidia/cuda-samples | ||
cd cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm | ||
make | ||
./cudaTensorCoreGemm | ||
``` | ||
|
||
## Add one vGPU on VM creation | ||
|
||
1. Create a VM and select the vGPU in vGPU devices | ||
1. After the VM is created run the pre-reqs as outlined above | ||
1. Run the grid installer for the vGPU driver on the VM | ||
1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` | ||
|
||
### Expected Results | ||
|
||
1. The VM should create successfully | ||
1. The code should execute successfully | ||
|
||
## Add one vGPU after VM creation | ||
|
||
1. Create a VM and don't select the vGPU in vGPU devices | ||
1. After the VM is created run the pre-reqs as outlined above | ||
1. Edit the config of the VM and add the vGPU and select to restart | ||
1. Run the grid installer for the vGPU driver on the VM | ||
1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` | ||
|
||
### Expected Results | ||
|
||
1. The VM should create successfully | ||
1. The code should execute successfully | ||
|
||
## Remove vGPU from VM | ||
|
||
1. Create a VM and select the vGPU in vGPU devices | ||
1. After the VM is created run the pre-reqs as outlined above | ||
1. Run the grid installer for the vGPU driver on the VM | ||
1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` | ||
1. Edit the VM config and remove the vGPU and select to restart | ||
1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` | ||
|
||
### Expected Results | ||
|
||
1. The VM should create successfully | ||
1. The code should execute successfully | ||
1. After removal of the vGPU the code should report that there aren't any cuda compatible cards available. | ||
|
||
## Disable vGPU device that is in use | ||
This is to be run on a vGPU device that is currently assigned to a VM | ||
|
||
1. Disable vGPU device that is in use | ||
|
||
### Expected Results | ||
|
||
1. You should get an error that says that the vGPU is in use | ||
|
||
## Add two vGPUs to VM on creation | ||
There are some limitations to this and they are outlined [here](https://docs.harvesterhci.io/v1.3/advanced/vgpusupport/#attaching-multiple-vgpus) | ||
|
||
1. Create a VM and select two vGPUs in vGPU devices | ||
1. Edit the yaml of the VM to add in the following yaml to one of the vGPUs | ||
```yaml | ||
virtualGPUOptions: | ||
display: | ||
ramFB: | ||
enabled: false | ||
``` | ||
1. After the VM is created run the pre-reqs as outlined above | ||
1. Run the grid installer for the vGPU driver on the VM | ||
1. Run the `./cuda-samples/Samples/3_CUDA_Features/cudaTensorCoreGemm` | ||
|
||
### Expected Results | ||
|
||
1. The VM should create successfully | ||
1. The code should execute successfully | ||
|
||
## Negative try to provision a VM with an already in use vGPU | ||
|
||
The way that vGPUs work is in a pool of resources so the easiest way to test this is if you only have one vGPU device enabled | ||
|
||
1. Create a VM and try to select the vGPU in vGPU devices | ||
|
||
### Expected Results | ||
|
||
1. No vGPU device should show up in the dropdown | ||
|
||
## Negative try to enable more vGPUs than the card supports | ||
|
||
This is going to vary based on which card you have since the driver itself is loading what's available to the `/sys` tree | ||
|
||
1. Attempt to enable a vGPU device after the card is fully provisioned | ||
|
||
### Expected Results | ||
|
||
1. There should be no profiles available in the dropdown | ||
|
||
## Negative try to enable vGPU on card that doesn't support it | ||
|
||
This should be ran on a server that has an Nvidia GPU that doesn't support vGPU | ||
|
||
1. After the PCI devices are enabled navigate to the `nvidia-driver-toolkit` addon and enable it | ||
1. Wait for the status to change to `Deploy Successful` | ||
1. Edit the config for the `nvidia-driver-toolkit` from the addons page and set the driver location | ||
1. Navigate to SR-IOV GPU Devices | ||
|
||
### Expected Results | ||
|
||
1. The GPU should not be listed in SR-IOV GPU Devices |