CUDA based ‐ High Performance Computing ‐ LLM Training ‐ Ground to GCP Cloud Hybrid

CUDA based HPC - LLM Training - Ground to GCP Cloud migration

Tracking - https://github.com/ObrienlabsDev/blog/issues/1 and https://github.com/ObrienlabsDev/blog/issues/6

Presentation

I am planning a talk around CUDA on Google Cloud (GCP) late 2023. If you would like to attend let me know at fmichaelobrien at google.com. I will be be posting the official 1-2h meet details very late 2023 - after Google Next 23

Abstract

HPC and GPU computing where the use of streaming processors in NVidia GPUs can be done on custom equipment like a dual MS GTX 4090 at 58 x 2 TFlops running over 1000W with access to 2 sets of 24G vram and 32k (32768 cores). If additional capacity or scaling is required then the use of cloud based GPUs like the Ada Lovelace L4 or L40 until the Grade Hopper based H100 is available late 2023 (public preview as of 1 Sept 2023) on GCP is advisable. This article is a description of GPU onboarding to GCP from a development and throughput perspective with a focus from the ground up through C++ based CUDA up through layers of ML/deep-learning libraries towards use cases such as LLM training or real time video entitity extraction for example.

We will review use of CUDA executables both in NVidia workstation VMs on GCP and via Kubernetes containers in Kubeflow

Nvidia and Google Cloud

Nvidia represents the new supercomputer manufacturer. When CUDA was introduced by Nvidia - essentially opening up an exponential multiplier of streaming GPU processors - we could not have imagined we would be at 16k (16384) cores per chip in 2023. A dual 4090 represents 32k (32768) parallel processors (close to the 64k original connection machine of 1988). Nvidia has flipped the normal GPU beside the CPU as a co-processor to effectively rendering the CPU partially irrelevant (even a 13900KS with 32 threads). The HPC is now a effectively single GPU (whether by NVLink for data center GPUs or PCIe 5.0 for consumer CPUs) - where the CPU is the coprocessor that feeds the GPU.

CUDA

Disclaimer: I am new to CUDA, I last worked with GPUs under DirectX 5.0 in 1999 - I go directly to Compute Capability 8.9 under Ada Lovelace

CUDA on GCP

Turn off organization policies compute.vmExternalIpAccess and compute.requireShieldedVm before - see https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/426 and https://github.com/GoogleCloudPlatform/pbmm-on-gcp-onboarding/issues/252 for details
Performance is expected lower than an Nvidia RTX-4090 (16384 cores) OC liquid cooled. 7700 ms vs 2200 ms on a PI calculation.

Use L4 (without NVlink) at US$2/h instead of A100's for now https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022

Via RDP (pw change) default open port - us-central1 (see also sa) on g2-standard-8 and gpu Nvidia L4 Screenshot 2023-07-15 at 00 47 20

CUDA on AWS

CUDA on Prem

Get yourself a good machine for local development before offloading to GCP for compute tasks. An example from 2015 is an Intel i7-5820k 6 core Haswell-E and an NVIDIA GTX970 or newer 2070. My current machine build is based on the Intel i9-13900K 8/16 core and an NVIDIA GTX4090 MSI Liquid-X - see the Appendix A: Build your local HPC

Use Cases

see https://github.com/ObrienlabsDev/blog/wiki/CUDA-based-%E2%80%90-High-Performance-Computing-%E2%80%90-LLM-Training-%E2%80%90-Ground-to-GCP-Cloud-Hybrid#gcp-hpc-vs-on-prem-hpc---llm-from-scratch-as-part-of-preparing-for-generative-ai

LLM - Large Language Model Training

Drone Streaming Entity Extraction Machine Learning

Collatz

Design Issues

Appendix A: CUDA on Google Cloud GPU based VMs

Types of GPUs available on GCP

Get additional GPU quota

if you get "Failed to start nvidia-rtx-virtual-workstation-window-4-vm: Operation type [start] failed with message "Quota 'NVIDIA_L4_GPUS' exceeded. Limit: 1.0 in region us-central1.""

Example ask for 2 GPU cards up from the default 1 - to test aggregation

2 min.

You only need to ask for NVIDIA_T4_VWS_GPUS and the bot will auto add GPUS_ALL_REGIONS

Your quota request for eventstream-dev has been approved and your project quota has been adjusted according to the following requested limits:

+--------------------+--------------------+-------------+-----------------+----------------+
| NAME               | DIMENSIONS         | REGION      | REQUESTED LIMIT | APPROVED LIMIT |
+--------------------+--------------------+-------------+-----------------+----------------+
| GPUS_ALL_REGIONS   |                    | GLOBAL      | 2               | 2              |
|                    |                    |             |                 |                |
| NVIDIA_T4_VWS_GPUS | region=us-central1 | us-central1 | 2               | 2              |
+--------------------+--------------------+-------------+-----------------+----------------+

and NVIDIA_L4_GPUS

Create your own Nvidia VM

Use an official Nvidia Optimized VM from the marketplace

New NVIDIA HPC SDK GPU-Optimized Image deployment

using https://console.cloud.google.com/marketplace/vm/config/nvidia-ngc-public/nvidia-gpu-cloud-hpc-sdk-image

Create a project

Enable API

Compute Engine API Cloud Deployment Manager V2 API Cloud Runtime Configuration API

Increase Quota

Change NVIDIA A100 GPUs - us-central1 from 0 to 1

Change GPUs (all regions) from 0 to 1

2 of 3 Quotas

5th - switched billing account with payment history

Change NVIDIA A100 GPUs - us-central1 from 0 to 1

"Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history."

Switching to another org with 1+ year of payment history - same thing - an issue with A100 availability Screenshot 2023-07-14 at 23 25 00

Switching GPU to V100 or T4

port 22 only

nvidia-gpu-cloud-hpc-sdk-image has resource level errors
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}

same in singapore
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/asia-southeast1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}

Southamerica-east-1c has capacity

https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022?project=cuda-old

Deployed OK

copy cuda executable example to GCP VM Screenshot 2023-07-15 at 00 34 26

GCP

PS C:\_cuda> .\gpusum.exe 1000000000 10000
gpu sum = 1.9999998123, steps 1000000000 terms 10000 threads 512 time 7769.056 ms
PS C:\_cuda>

RTX 4090

micha@13900a MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum.exe 1000000000 10000
gpu sum = 1.9999996113, steps 1000000000 terms 10000 threads 512 time 2233.573 ms

Image the VM

gcloud beta compute machine-images create nvidia-rtx-virtual-workstation-window-4-vm --project=cuda-old --description=nvidia-rtx-virtual-workstation-window-4-vm --source-instance=nvidia-rtx-virtual-workstation-window-4-vm --source-instance-zone=us-central1-a --storage-location=us

Note: you will not always be able to get quota above 1 GPU per region without involving your field sales rep. In this case spread your GPU's over separate regions (assuming you are not moving traffic between the VMs hosting the GPU's)

Try alternative requests for quota - batch an entire set of regions and you may get approved as opposed to specific requests. As you can see only us-east1 has capacity limits - the rest of us-south, us-west and us-east are ok

Your quota request for cuda-old has been partially approved and your project quota has been adjusted according to the following requested limits:

+----------------+------------------+-----------+-----------------+----------------+
| NAME           | DIMENSIONS       | REGION    | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+------------------+-----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4  | us-east4  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-east5  | us-east5  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-south1 | us-south1 | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west1  | us-west1  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west2  | us-west2  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west3  | us-west3  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west4  | us-west4  | 2               | 2              |
+----------------+------------------+-----------+-----------------+----------------+

Unfortunately, we were unable to grant your below quota request(s):

+----------------+-----------------+----------+
| NAME           | DIMENSIONS      | REGION   |
+----------------+-----------------+----------+
| NVIDIA_L4_GPUS | region=us-east1 | us-east1 |

ask for 4 (note there is a lag on the quota screen - it will say 1 instead of the approved 2 for up to 10 min.
| NAME           | DIMENSIONS      | REGION   | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 |               4 |              4 |

in this case you will need to manually increase the global quota as well from the previous 2

| NAME             | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS |            | GLOBAL |               6 |              4 |

you may run into VM capacity issues in recently working regions

A g2-standard-8 VM instance is currently unavailable in the us-east1-b zone

we can end up with the following after several 60 quota request turnarounds.

An alternative is to sign up for support (Basic is fine) - and raise a P3 support ticket

Another option is to switch billing accounts to one of your organizations that has more spend history - approvals will be easier - then switch the Billing ID back.

| NAME           | DIMENSIONS      | REGION   | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 |               4 |              4 |

| NAME             | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS |            | GLOBAL |               4 |              4

Design Issues

DI 20220807-1: Moving millions of small images between file systems

default is less than < 1m per directory and using Nvme drives on one or both sides of the copy or use 10Gbit ethernet
https://colab.research.google.com/github/christianmerkwirth/colabs/blob/master/Understanding_Randomization_in_TF_Datasets.ipynb

FinOps

Daily Usage Scenarios

0827-1157 3.5 h - getting cost tomorrow single GPU - 1739-1842 dual GPU total $4.75/h

i7-5830 + GTX-970 (2015) i7-5830 + GTX-2070 (2019) i9-13900K + GTX-4090 (2023)

Monthly Usage Scenarios

Sustained Use

Dual L4 GPU

https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022

Check GPU status

PS C:\Users\michael> nvidia-smi
Sun Aug 20 12:40:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25                 Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                    WDDM  | 00000000:00:03.0 Off |                    0 |
| N/A   43C    P8              13W /  72W |    126MiB / 23034MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                    WDDM  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8              13W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

4 GPUs
PS C:\Users\michael> nvidia-smi
Sun Aug 20 13:17:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25                 Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                    WDDM  | 00000000:00:03.0 Off |                    0 |
| N/A   52C    P8              14W /  72W |    143MiB / 23034MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                    WDDM  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              12W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA L4                    WDDM  | 00000000:00:05.0 Off |                    0 |
| N/A   46C    P8              12W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA L4                    WDDM  | 00000000:00:06.0 Off |                    0 |
| N/A   50C    P8              13W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

no NVlink on L4's but we can still run cuda jobs across all GPUs using the PCIe bus

PS C:\Users\michael> nvidia-smi nvlink --status -i 0
PS C:\Users\michael>  nvidia-smi nvlink -g 0 -i 0

Reservations

Appendix D: Build your local HPC

CSPs like GCP provide for scalable on-demand GPUs like the A100. A properly configured local system can also be used for local development as an adjacent dev system to the cloud. The current as of mid 2023 top system from NVIDIA is the 4090. For optimal performance purchase the MSI Suprim Liquid-X which is factory overclocked and contains an integration AIO 240mm liquid cooler.

NVIDIA Quatro RTX GPUs for professional workstations

The NVidia Quatro (now RTX) family of GPUs benefit from the following additional capabilities above consumer grade GEForce GPUs - however this comes at a higher cost.

Stability and Durability under sustained load due to additional binning of select chips that have undergone additional testing by ISVs for TDP profiles.
Improved performance under double precision floating point operations (for cases where deep learning benefits from not reducing precision)
ECC memory to reduce errors and crashes
Commercial Scientific/Design software mandates ECC memory
Physical profiles to add single/dual slot closer placement
Multi-GPU support up to generation Ampere via 112 GBps NVLink - for shared memory without using the PCIe x16 bus

NVIDIA Quatro RTX A4000

The RTX-A4000 does not include 112GBps NVlink like the A4500 and above. Note: the Ada A4500 and above no longer have NVlink and rely on the PCIe Bus which runs at 2GBps per channel up to 32GBps in both directions for PCIe 4.0 x16 - where PCIe 5.0 doubles this to 63 GBps - which is half the speed of NVlink but only if we run the RTX cards on Xeon boards. On z790 boards we split the cards into x8 - yielding 16GBps and in reality 8GBps.

GTX-4000 Workstation Single Slot Placement Screenshot 2023-08-20 at 11 41 18

GTX-4000 Ampere generation - GA-104 under GPU benchmark load = 11 GFlops (20% of a GTX-4090) - https://github.com/ObrienlabsDev/blog/wiki/High-Performance-Computing#appendix-g-benchmark-quatro-rtx-a4000-workstation-gpu---2021---ampere---ga104 Screenshot 2023-08-20 at 09 45 25

NVIDIA Quatro RTX A4500 with NVlink

The RTX-A4500 includes 112GBps NVlink

RTX-A4500 Ampere generation professional workstation cards

2023: NVIDIA RTX-4090 MSI Suprim Liquid-X 24Gb 16384 streaming cores on AIO Liquid Cooled Intel i9-13900KS 8/16 p/e core 128Gb

Dual NVIDIA RTX-4090 MSI Suprim Liquid-X with 32768 streaming cores

Running a dual 4090 Ada Lovelace architecture GPU setup will require the highest end components

GTX-4090 Ada generation consumer cards

Power supply must be 1600W for two reasons. Each 4090 card requires 4 separate PCI power connectors - combine this with 2 CPU power connectors to the motherboard and you require 10 8-pin power outputs. Most 1500W supplies only have 9. The second reason is that a power supply is most efficient running at 50% load. As you will be running up to 11A or close to 1400W you need to have some room. Running a 3rd 4090 is infeasible as it would overload the 1800W fuse on a single 15A line.
Cost $10100 ($1000 CPU + $6000 2 xGPU + $800 motherboard + $800 192G ram + $500 AIO + $600 2 x nVMe SSD + $200 7000 case for 2 4090 coolers + $200 OS) - amortize this over 4 years to get $210/month
Electricity cost: $0.15 to $0.03 KWh. Double this for debt retirement... and an average of $0.15 KWh or $110/month for a max load dual 4090
Total cost of dual 4090 on-prem workstation is $320/month Compare this to a dual GCP L4 which has the performance of a GTX-4060 TI but the memory (24g) of a GTX-4090 running

Runtime Metrics

CPU Power: Separate 340 watts peak
GPU Power: Separate 450 watts peak
CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak

Power equation using the magnetic field around the split AC line using a clamp meter: W = VA or watts = (120v)(12A) = 1400+ Watts. This is getting close to the 1800 W limit on the 15A fuse line with everything else like monitors on a separate 15A line.

Thermals

Single NVIDIA RTX-4090

The i9-13900KS with stock overclock to 6.0 GHz is on back order in Canada for 6 weeks so far - using the non-binned i9-13900K with AI overclocking on the Z790 board resulted in 6.1 GHz is ok for now.

Under Intel Extreme Tuning and Furmark Stress Testing

FLIR Heatmap (CPU incoming radiator on top, GPU outgoing radiator on the side)

GPU running at 22% at 450 frames/sec, CPU at 100%

Cost $7000 ($1000 CPU + $3000 GPU)
CPU Power: Separate 340 watts peak
GPU Power: Separate 450 watts peak
CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak

Compute Level: 8.9 (2023 current highest level)

2020: Mobile NVIDIA Quadro RTX-5000 on Intel Xeon W 10855M 128Gb https://thinkstation-specs.com/thinkpad/p17/

2015: NVIDIA GTX-970 4Gb or GTX-2070 8Gb on AIO Liquid Cooled Intel i7-5820k 6 core Haswell-E 64Gb

Links

Appendix E: Setup a workstation with NVidia CUDA

CUDA on Windows

Install Visual Studio Community 2022

download VisualStudioSetup.exe - https://visualstudio.microsoft.com/downloads/

Get an NVidia developer account

Install NVidia CUDA and NSight Software

install CUDA after installing Microsoft Visual Studio - to get integration working
cuda_12.2.0_536.25_windows.exe from https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=Server2022&target_type=exe_local
GeForce_Experience_v3.27.0.112.exe
nsight_visual_studio_edition-windows-x86_64-2023.1.1.23100_32709422.msi

Verify Configuration

Fix CUDA Configuration: CUDA.props not found

Fix: in VS verify Project | Build Customizations | is set to a later CUDA version if switching machines (12.1 to 12.2)

Error | MSB4019 | The imported project "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.1.props" was not found. Confirm that the expression in the Import declaration "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\\BuildCustomizations\CUDA 12.1.props" is correct, and that the file exists on disk. | gpusum | C:\wse_github_vs\RichardAns\CUDA-Programs\Chapter01\gpusum\gpusum_vs2022.vcxproj | 35 |

VS integration missing for Windows SDK

Warning | MSB8003 | The WindowsSDKDir property is not defined. Some build tools may not be found. | gpusum | C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets | 513 |

Fix: VS | tools | get tools... | individual components | windows 11 SDK (latest)

Check number of GPUs


	int ngpu = 0;
	cudaGetDeviceCount(&ngpu);
	printf("Number of GPUs on this PC is %d\n",ngpu);


michael@13900b MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum 1000000000 1000
Number of GPUs on this PC is 2

Appendix F: Benchmark Dual 4090 GPUs

Running the two MSI 4090 Suprim Liquid-X GPU at max ram (22/24G x 2 and cores 16384 x 2) consumes 8A or 960 watts above the idle power of 140 watts. Running a fully saturated 13900KS system at 80% CPU and 200% GPU consumes 12.2A or 1460 watts - hence the need for the 10 CPU+PCIE connectors on a 1600 watt PSU - way above it's performance peak of 800W.

The actual performance of the system is 58 x 2 TFlops = 116 Tflops even with the lack of NVlink and the default 16x/4x PCIe lanes - for a non-RAM bound test.

Use the following multi GPU capable library from Ville Timonen instead of furmark.

https://github.com/wilicc/gpu-burn


PS C:\wse_github> git clone https://github.com/wilicc/gpu-burn

PS C:\wse_github\gpu-burn> docker build -t gpu_burn .

PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7%  proc'd: 231 (58146 Gflop/s) - 308 (58428 Gflop/s)   errors: 0 - 0   temps: 58 C - 59 C
        Summary at:   Sat Jul 29 03:02:10 UTC 2023

23.3%  proc'd: 616 (58592 Gflop/s) - 693 (58540 Gflop/s)   errors: 0 - 0   temps: 58 C - 60 C
        Summary at:   Sat Jul 29 03:02:17 UTC 2023

36.7%  proc'd: 1001 (58460 Gflop/s) - 1078 (58479 Gflop/s)   errors: 0 - 0   temps: 59 C - 53 C
        Summary at:   Sat Jul 29 03:02:25 UTC 2023

48.3%  proc'd: 1386 (58381 Gflop/s) - 1463 (58600 Gflop/s)   errors: 0 - 0   temps: 59 C - 60 C
        Summary at:   Sat Jul 29 03:02:32 UTC 2023

60.0%  proc'd: 1771 (58153 Gflop/s) - 1848 (58534 Gflop/s)   errors: 0 - 0   temps: 61 C - 60 C
        Summary at:   Sat Jul 29 03:02:39 UTC 2023

71.7%  proc'd: 2156 (58299 Gflop/s) - 2233 (58364 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sat Jul 29 03:02:46 UTC 2023

83.3%  proc'd: 2541 (58267 Gflop/s) - 2541 (58366 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sat Jul 29 03:02:53 UTC 2023

93.3%  proc'd: 2849 (58368 Gflop/s) - 2926 (58350 Gflop/s)   errors: 0 - 0   temps: 62 C - 61 C
        Summary at:   Sat Jul 29 03:02:59 UTC 2023

100.0%  proc'd: 3080 (58489 Gflop/s) - 3157 (58364 Gflop/s)   errors: 0 - 0   temps: 62 C - 61 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done

Tested 2 GPUs:
        GPU 0: OK
        GPU 1: OK


overclock 0

PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7%  proc'd: 308 (60443 Gflop/s) - 231 (57900 Gflop/s)   errors: 0 - 0   temps: 57 C - 61 C
        Summary at:   Sun Jul 30 23:27:20 UTC 2023

23.3%  proc'd: 693 (60566 Gflop/s) - 616 (57938 Gflop/s)   errors: 0 - 0   temps: 58 C - 61 C
        Summary at:   Sun Jul 30 23:27:27 UTC 2023

35.0%  proc'd: 1001 (60602 Gflop/s) - 1001 (57692 Gflop/s)   errors: 0 - 0   temps: 58 C - 61 C
        Summary at:   Sun Jul 30 23:27:34 UTC 2023

46.7%  proc'd: 1386 (60485 Gflop/s) - 1386 (57818 Gflop/s)   errors: 0 - 0   temps: 59 C - 61 C
        Summary at:   Sun Jul 30 23:27:41 UTC 2023

58.3%  proc'd: 1848 (60455 Gflop/s) - 1694 (57734 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sun Jul 30 23:27:48 UTC 2023

68.3%  proc'd: 2156 (60404 Gflop/s) - 2002 (57703 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sun Jul 30 23:27:54 UTC 2023

80.0%  proc'd: 2541 (60479 Gflop/s) - 2387 (57735 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sun Jul 30 23:28:01 UTC 2023

91.7%  proc'd: 2926 (60374 Gflop/s) - 2772 (57690 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sun Jul 30 23:28:08 UTC 2023

100.0%  proc'd: 3234 (60315 Gflop/s) - 3157 (57895 Gflop/s)   errors: 0 - 0   temps: 62 C - 55 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done

Tested 2 GPUs:
        GPU 0: OK
        GPU 1: OK

Appendix G: Benchmark Quatro RTX A4000 workstation GPU - 2021 - Ampere - GA104

micha@13900a MINGW64 /c/wse_github/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX A4000 (UUID: GPU-c585959e-3209-a20e-2522-e9420f268bc8)
Using compare file: compare.ptx
Burning for 60 seconds.
16.7%  proc'd: 50 (8805 Gflop/s)   errors: 0   temps: 63 C
        Summary at:   Sat Aug 19 19:49:53 UTC 2023
33.3%  proc'd: 150 (10864 Gflop/s)   errors: 0   temps: 70 C
        Summary at:   Sat Aug 19 19:50:03 UTC 2023
45.0%  proc'd: 250 (10477 Gflop/s)   errors: 0   temps: 75 C
        Summary at:   Sat Aug 19 19:50:10 UTC 2023
58.3%  proc'd: 300 (10344 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Sat Aug 19 19:50:18 UTC 2023
71.7%  proc'd: 400 (10158 Gflop/s)   errors: 0   temps: 79 C
        Summary at:   Sat Aug 19 19:50:26 UTC 2023
83.3%  proc'd: 450 (10092 Gflop/s)   errors: 0   temps: 80 C
        Summary at:   Sat Aug 19 19:50:33 UTC 2023
98.3%  proc'd: 550 (10000 Gflop/s)   errors: 0   temps: 81 C
        Summary at:   Sat Aug 19 19:50:42 UTC 2023
100.0%  proc'd: 600 (9948 Gflop/s)   errors: 0   temps: 81 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16375 MB of memory (14897 MB available, using 13407 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 50 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

Appendix H: Benchmark Quad GCP L4 - no NVlink GPUs

Appendix I: Benchmark Quatro RTX-5000 in Lenovo P17 Gen1 workstation GPU - 2018 - Turing - TU104

Mobile Xeon W-10855M 2.8-4.7GHz version of https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/quadro-rtx-5000-data-sheet-us-nvidia-704120-r4-web.pdf

Benchmark


PS C:\Users\micha> nvidia-smi
Sun Aug 20 20:46:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 5000              WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   65C    P2             107W / 110W |  14998MiB / 16384MiB |    100%      Default |
|                                         |                      |                  N/A |

micha@LAPTOP-M4VQDR8K MINGW64 /c/_dev/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: Quadro RTX 5000 (UUID: GPU-e180b2a0-1a44-e625-8ac1-551fe7b1ee35)
Using compare file: compare.ptx
Burning for 60 seconds.
18.3%  proc'd: 51 (5247 Gflop/s)   errors: 0   temps: 61 C
        Summary at:   Mon Aug 21 00:39:19 UTC 2023
31.7%  proc'd: 102 (6796 Gflop/s)   errors: 0   temps: 62 C
        Summary at:   Mon Aug 21 00:39:27 UTC 2023
45.0%  proc'd: 153 (6737 Gflop/s)   errors: 0   temps: 65 C
        Summary at:   Mon Aug 21 00:39:35 UTC 2023
58.3%  proc'd: 153 (6737 Gflop/s)   errors: 0   temps: 66 C
        Summary at:   Mon Aug 21 00:39:43 UTC 2023
73.3%  proc'd: 255 (6646 Gflop/s)   errors: 0   temps: 67 C
        Summary at:   Mon Aug 21 00:39:52 UTC 2023
88.3%  proc'd: 306 (6616 Gflop/s)   errors: 0   temps: 67 C
        Summary at:   Mon Aug 21 00:40:01 UTC 2023
100.0%  proc'd: 306 (6616 Gflop/s)   errors: 0   temps: 68 C
        Summary at:   Mon Aug 21 00:40:08 UTC 2023

100.0%  proc'd: 357 (6607 Gflop/s)   errors: 0   temps: 68 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16383 MB of memory (15085 MB available, using 13576 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 51 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

Appendix X: Older GPUs

Fermi

Kepler

Maxwell

GTX-970

C:\_dev>nvidia-smi
Mon Aug 21 20:33:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 970       WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   41C    P8              18W / 163W |    546MiB /  4096MiB |     13%      Default |

Pascal

Volta

Ampere

Appendix Y: NVidia GPU Comparisons

GPU Matrix

Data Center GPUs

https://cloud.google.com/nvidia
https://aws.amazon.com/nvidia/
Full 400 GBps NVLink
https://www.coreweave.com/products/hgx-h100
https://developer.nvidia.com/blog/breaking-mlperf-training-records-with-nvidia-h100-gpus/
https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh200-ai-supercomputer
H3 VMs with H100 GPUs are available on public preview in select regions (us-central1-a, europe-west4-b/c)

gcloud compute instances create gpu-h100-usc1a --project=cuda-obs --zone=us-central1-a --machine-type=h3-standard-88 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=816469424864-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=gpu-h100-usc1a,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230814,mode=rw,size=10,type=projects/cuda-obs/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Quatro : RTX Workstation GPUs

GEForce : GTX Consumer GPUs

as of 40xx no more NVLink

Appendix Z1: Google Cloud Developer Day Presentation

Plans

GCP HPC vs On-Prem HPC - LLM from scratch as part of preparing for Generative AI
alternate: GCP HPC vs On-Prem HPC - for live drone AI Entity Extraction

GCP HPC vs On-Prem HPC - LLM from scratch as part of preparing for Generative AI

Walkthrough

Start with the following site - read it in its entirety https://jaykmody.com/blog/gpt-from-scratch/ or https://github.com/lm-sys/FastChat

Code

Run Tensorflow in docker

Run CUDA in Kubernetes

Run CUDA in docker

Dockerfile

FROM nvidia/cuda:12.2.0-devel-ubi8
CMD nvidia-smi

docker build -t nvidia-smi .
docker run --rm --gpus all nvidia-smi 

Mon Oct  9 20:23:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P1000                   On  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8              N/A /  20W |    543MiB /  4096MiB |      4%      Default |

The key to GPU passthrough to docker is the --gpus variable - if you don't set it you will get the following

$ docker run --rm nvidia-smi

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/bin/sh: nvidia-smi: command not found

CUDA runtime in docker

CUDA development in docker

See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

runtime
docker run -dit --name cuda-runtime nvidia/cuda:11.8.0-runtime-ubi8
or dev
docker run -dit --name cuda-devel nvidia/cuda:11.8.0-devel-ubi8

Ubuntu image

FROM ubuntu:22.04

RUN apt-get update && apt-get -y install sudo

RUN useradd -m docker && echo "docker:docker" | chpasswd && adduser docker sudo

RUN apt-get install curl -y
RUN apt-get install gpg -y

# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt-get update
RUN sudo apt-get install -y nvidia-container-toolkit
USER docker
CMD /bin/bash


sudo nvidia-ctk runtime configure --runtime=docker

Demo

Training

https://course.fast.ai/
Start with Introduction to Generative AI Learning Path at GCP - https://www.cloudskillsboost.google/journeys/118
Next - Generative AI for Developers Learning Path at GCP - https://www.cloudskillsboost.google/journeys/183

Resources

Richard Ansorge of Cambridge

Deep Learning on NVidia and M2 GPUs

News

https://venturebeat.com/ai/mistral-ai-europe-startup-releases-mistral-7b-model/

LLM Requirements

https://www.techtarget.com/searchstorage/news/366537138/Storages-role-in-generative-AI

Appendix: GPU Chips

Data Center:

Grace Hopper (Ada Lovelace gen) H100

Ampere/Lovelace : L40, L40s

Professional: RTX

Pascal Volta Tesla Ampere Ada Lovelace