Configure NVIDIA GPU on ESXi to be used by VMs
How NVIDIA vGPU Software is Used ?
NVIDIA vGPU software can be used several way.
1. NVIDIA vGPU NVIDIA Virtual GPU (vGPU) enables multiple virtual machines (VMs) to have simultaneous, direct access to a single physical GPU, using the same NVIDIA graphics drivers that are deployed on non-virtualized operating systems. By doing this, NVIDIA vGPU provides VMs with unparalleled graphics performance, compute performance, and application compatibility, together with the cost-effectiveness and scalability brought about by sharing a GPU among multiple workloads.
2. GPU Pass-Through In GPU pass-through mode, an entire physical GPU is directly assigned to one VM, bypassing the NVIDA Virtual GPU Manager. In this mode of operation, the GPU is accessed exclusively by the NVIDIA driver running in the VM to which it is assigned. The GPU is not shared among VMs.
3. Bare-Metal Deployment In a bare-metal deployment, you can use NVIDIA vGPU software graphics drivers with Quadro vDWS and GRID Virtual Applications licenses to deliver remote virtual desktops and applications. If you intend to use Tesla boards without a hypervisor for this purpose, use NVIDIA vGPU software graphics drivers, not other NVIDIA drivers.
In this post, I am going to show you the the first option as some of our customers need VMs that can able to do highly intensive calculations which is more favorable with the GPU rather than CPU. For this, we have some ESXi host that have a GPU card and we are going to serve GPU powers to be used as vGPU for their virtual machines. You can find all the procedure from ESXi host configuration to VM configuration.
Installing NVIDIA vGPU manager on Hypervisor(ESXi).
NVIDIA vGPU Manager is a software that runs on the hypervisor.
-
Set your ESXi Host to Maintenance Mode
-
Login to ESXi host(s)
-
Install the vib package.
esxcli software vib install -d /vmfs/volumes/5f749922-b768d2e5-1a9c-1125b5dea108/NVIDIA-VMware-418.165.01-1OEM.670.0.0.81699
Configure Graphic Card on vCenter
On vCenter, set graphic card from Shared to Shared Direct (Figure-1)
Figure-1 Shared to Shared Direct
Once you finished the configuration you have to reboot the ESXi host.
Verification
[root@GPUHOST:~] nvidia-smi
Fri Oct 16 20:18:46 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.165.01 Driver Version: 418.165.01 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:18:00.0 Off | Off |
| N/A 36C P8 16W / 70W | 86MiB / 16383MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@GPUHOST:~] vmkload_mod -l | grep nvidia
nvidia 13 17680
It shows that vGPU software installed succussfully on ESXi host.
After configuring vGPU Manager on hypervisor, we can test it by creating a virtual machine and use CUDA API to run code on vGPU. To add vGPU we need to add Shared PCI Device and choose the Physical GPU and Profile.(See Figure-2)
Figure-2 Adding Shared PCI Device and Choosing vGPU Profile.
To test vGPU we need to install NVIDIA driver binary on the virtual machine. You can download the bundle in NVIDIA website.
Installing Driver To install driver it is necessary to install kernel-headers and development packages. So, Install necessary packages accordingly based on your GNU/Linux distro. For me is a CentOS 7 Latest.
Installing Development Tools and Kernel headers
yum group install "Development Tools"
yum install kernel-devel-$(uname -r)
Installing NVIDIA vGPU Driver
./NVIDIA-Linux-x86_64-418.165.01-grid.run --kernel-source-path /usr/src/kernels/$(uname -r)
NVIDIA License Server Configuration NVIDIA vGPU software is a licensed product. Licensed vGPU functionalities are activated during guest OS boot by the acquisition of a software license served over the network from an NVIDIA vGPU software license server. The license is returned to the license server when the guest OS shuts down.
This post is not how to install NVIDIA License Server, So I will skip that and show how to configure client to connect NVIDIA License server
[root@gputest nvidia]# cd /etc/nvidia
[root@gputest nvidia]# cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
ServerAddress= 10.10.10.10
ServerPort= 7070
BackupServerAddress= 10.10.10.11
BackupServerPort= 7070
Restart the service nvidia-gridd.service
you should see similar message in the /var/log/messages like below
Oct 12 11:34:12 gputest nvidia-gridd: Acquiring license. (Info: http://10.10.10.10:7070/request; Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0)
Oct 12 11:34:12 gputest nvidia-gridd: Calling load_byte_array(tra)
Oct 12 11:34:13 gputest nvidia-gridd: License acquired successfully. (Info: http://10.10.10.10:7070/request; Quadro-Virtual-DWS,5.0)
If everyhting is fine, you should be able to see the driver version by issuing command nvidia-smi
[root@gputest nvidia]# nvidia-smi
Mon Oct 12 11:35:18 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.165.01 Driver Version: 418.165.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID T4-8Q On | 00000000:02:01.0 Off | N/A |
| N/A N/A P8 N/A / N/A | 528MiB / 8192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Installing Cuda
To Install CUDA successfully, you must download correct Cuda version that is compatible with the vGPU driver. As you see above CUDA version 10.1, so we must install that version of CUDA.(See Figure-3)
Figure-3 How to download correct CUDA version for OS
Run CUDA binary
sh cuda_10.1.243_418.87.00_linux.run
Do not install the driver as we already installed it. So un-tick it, Otherwise you may get an error (Figure-4)
Figure-4 Do not choose driver as we already installed it.
If everything successfuly, you should get a message similar below. So, configure PATH and LD_LIBRARY_PATH environment variables accordingly.
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
Verify
You can run the command below on the vm to verify your CUDA installation.
[root@gputest ~]# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
To test your GPU got to /usr/local/cuda-10.1/samples
[root@gputest vectorAdd_nvrtc]# make
g++ -I../../common/inc -I/usr/local/cuda/include -o vectorAdd.o -c vectorAdd.cpp
g++ -o vectorAdd_nvrtc vectorAdd.o -L/usr/local/cuda/lib64/stubs -lcuda -lnvrtc
mkdir -p ../../bin/x86_64/linux/release
cp vectorAdd_nvrtc ../../bin/x86_64/linux/release
[root@gputest vectorAdd_nvrtc]#
After compile any of the sample like above. You can run the binary and check.
Experiment
Important
If you are getting an error while creating a Virtual machine with vGPU like below.
NVIDIA GRID – “Could not initialize plugin ‘/usr/lib64/vmware/plugin/libnvidia-vgx.so’ for vGPU “profile_name””
Solution
You need to disable the “ECC Memory” otherwise your VMs will not power on and you will be left with the error. To disable ECC Memory;
- Set your ESXi host to Maintenance Mode.
- Start a SSH session to your ESX host and apply the following commands:
/etc/init.d/xorg stop
nv-hostengine -t
nv-hostengine -d
/etc/init.d/xorg start
nvidia-smi -e 0
reboot
Disclaimer
The information contained in this website is for general information purposes only. The information is provided by me and while I endeavour to keep the information up to date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.