CUDA based ‐ High Performance Computing ‐ LLM Training ‐ Ground to GCP Cloud Hybrid

CUDA based HPC - LLM Training - Ground to GCP Cloud migration

Tracking - https://github.com/ObrienlabsDev/blog/issues/1 and https://github.com/ObrienlabsDev/blog/issues/6

Presentation

I am planning a talk around CUDA on Google Cloud (GCP) late 2023. If you would like to attend let me know at fmichaelobrien at google.com. I will be be posting the official 1-2h meet details very late 2023 - after Google Next 23

Meta LLaMA 2 70B model on Apple M2 Ultra 64G via https://github.com/ggerganov/llama.cpp https://github.com/ObrienlabsDev/machine-learning/issues/7

Abstract

HPC and GPU computing where the use of streaming processors in NVidia GPUs can be done on custom equipment like a dual MS GTX 4090 at 58 x 2 TFlops running over 1000W with access to 2 sets of 24G vram and 32k (32768 cores). If additional capacity or scaling is required then the use of cloud based GPUs like the Ada Lovelace L4 or L40 until the Grade Hopper based H100 is available late 2023 (public preview as of 1 Sept 2023) on GCP is advisable. This article is a description of GPU onboarding to GCP from a development and throughput perspective with a focus from the ground up through C++ based CUDA up through layers of ML/deep-learning libraries towards use cases such as LLM training or real time video entitity extraction for example.

We will review use of CUDA executables both in NVidia workstation VMs on GCP and via Kubernetes containers in Kubeflow

Nvidia and Google Cloud

Nvidia represents the new supercomputer manufacturer. When CUDA was introduced by Nvidia - essentially opening up an exponential multiplier of streaming GPU processors - we could not have imagined we would be at 16k (16384) cores per chip in 2023. A dual 4090 represents 32k (32768) parallel processors (close to the 64k original connection machine of 1988). Nvidia has flipped the normal GPU beside the CPU as a co-processor to effectively rendering the CPU partially irrelevant (even a 13900KS with 32 threads). The HPC is now a effectively single GPU (whether by NVLink for data center GPUs or PCIe 5.0 for consumer CPUs) - where the CPU is the coprocessor that feeds the GPU.

CUDA

Disclaimer: I am new to CUDA, I last worked with GPUs under DirectX 5.0 in 1999 - I go directly to Compute Capability 8.9 under Ada Lovelace

CUDA on GCP

Turn off organization policies compute.vmExternalIpAccess and compute.requireShieldedVm before - see https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/426 and https://github.com/GoogleCloudPlatform/pbmm-on-gcp-onboarding/issues/252 for details
Performance is expected lower than an Nvidia RTX-4090 (16384 cores) OC liquid cooled. 7700 ms vs 2200 ms on a PI calculation.

GCP G2 VM running Linux with 2 L4 GPUs and the Deep Learning image

see https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/747

Dual L4 g2-standard-24 24/96G - running DL image

no need for manual driver install https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local - but follow
https://cloud.google.com/compute/docs/gpus/install-drivers-gpu
supplied drivers : | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |

michael@cloudshell:~ (clouddeploy-ol)$ gcloud config set project cuda-old
Updated property [core/project].
michael@cloudshell:~ (cuda-old)$ gcloud compute instances create l4-4-2 --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-4-2,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Created [https://www.googleapis.com/compute/v1/projects/cuda-old/zones/us-east4-c/instances/l4-4-2].
NAME: l4-4-2
ZONE: us-east4-c
MACHINE_TYPE: g2-standard-24
PREEMPTIBLE: 
INTERNAL_IP: 10.150.0.10
EXTERNAL_IP: 34.
STATUS: RUNNING

ssh

======================================
Welcome to the Google Deep Learning VM
======================================

Version: common-gpu.m113
Resources:
 * Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questions/tagged/google-dl-platform
 * Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
 * Google Group: https://groups.google.com/forum/#!forum/google-dl-platform

To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh
Linux l4-4-2 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.

This VM requires Nvidia drivers to function correctly.   Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n] 

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17......
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with
         this installation of the NVIDIA driver.

ok

running a python vm
(base) michael@l4-4-2:~$ nvidia-smi
Thu Nov 30 19:51:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA L4           Off  | 00000000:00:03.0 Off |                    0 |
| N/A   60C    P0    32W /  72W |      0MiB / 23034MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA L4           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   57C    P0    31W /  72W |      0MiB / 23034MiB |      7%      Default |
|                               |                      |                  N/A |

TensforFlow / Keras test ML training run

Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit

https://github.com/ObrienlabsDev/machine-learning



base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py 
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()

with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
  parallel_model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)
  loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
  parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)

(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile 
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]

base) michael@l4-4-2:~/machine-learning$ ./build.sh 
Sending build context to Docker daemon  6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow

successfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory:  -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory:  -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30

023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893      80 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$

Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300

NVidia Virtual Workstation - Windows Server

Use L4 (without NVlink) at US$2/h instead of A100's for now https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022

Via RDP (pw change) default open port - us-central1 (see also sa) on g2-standard-8 and gpu Nvidia L4 Screenshot 2023-07-15 at 00 47 20

CUDA on AWS

CUDA on Prem

Get yourself a good machine for local development before offloading to GCP for compute tasks. An example from 2015 is an Intel i7-5820k 6 core Haswell-E and an NVIDIA GTX970 or newer 2070. My current machine build is based on the Intel i9-13900K 8/16 core and an NVIDIA GTX4090 MSI Liquid-X - see the Appendix A: Build your local HPC

Use Cases

see https://github.com/ObrienlabsDev/blog/wiki/CUDA-based-%E2%80%90-High-Performance-Computing-%E2%80%90-LLM-Training-%E2%80%90-Ground-to-GCP-Cloud-Hybrid#gcp-hpc-vs-on-prem-hpc---llm-from-scratch-as-part-of-preparing-for-generative-ai

LLM - Large Language Model Training

Drone Streaming Entity Extraction Machine Learning

Collatz

Design Issues

Appendix A: CUDA on Google Cloud GPU based VMs

Types of GPUs available on GCP

Get additional GPU quota

if you get "Failed to start nvidia-rtx-virtual-workstation-window-4-vm: Operation type [start] failed with message "Quota 'NVIDIA_L4_GPUS' exceeded. Limit: 1.0 in region us-central1.""

Example ask for 2 GPU cards up from the default 1 - to test aggregation

2 min.

You only need to ask for NVIDIA_T4_VWS_GPUS and the bot will auto add GPUS_ALL_REGIONS

Your quota request for eventstream-dev has been approved and your project quota has been adjusted according to the following requested limits:

+--------------------+--------------------+-------------+-----------------+----------------+
| NAME               | DIMENSIONS         | REGION      | REQUESTED LIMIT | APPROVED LIMIT |
+--------------------+--------------------+-------------+-----------------+----------------+
| GPUS_ALL_REGIONS   |                    | GLOBAL      | 2               | 2              |
|                    |                    |             |                 |                |
| NVIDIA_T4_VWS_GPUS | region=us-central1 | us-central1 | 2               | 2              |
+--------------------+--------------------+-------------+-----------------+----------------+

and NVIDIA_L4_GPUS

Create your own Nvidia VM

Use an official Nvidia Optimized VM from the marketplace

New NVIDIA HPC SDK GPU-Optimized Image deployment

using https://console.cloud.google.com/marketplace/vm/config/nvidia-ngc-public/nvidia-gpu-cloud-hpc-sdk-image

Create a project

Enable API

Compute Engine API Cloud Deployment Manager V2 API Cloud Runtime Configuration API

Increase Quota

Change NVIDIA A100 GPUs - us-central1 from 0 to 1

Change GPUs (all regions) from 0 to 1

2 of 3 Quotas

5th - switched billing account with payment history

Change NVIDIA A100 GPUs - us-central1 from 0 to 1

"Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history."

Switching to another org with 1+ year of payment history - same thing - an issue with A100 availability Screenshot 2023-07-14 at 23 25 00

Switching GPU to V100 or T4

port 22 only

nvidia-gpu-cloud-hpc-sdk-image has resource level errors
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}

same in singapore
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/asia-southeast1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}

Southamerica-east-1c has capacity

https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022?project=cuda-old

Deployed OK

copy cuda executable example to GCP VM Screenshot 2023-07-15 at 00 34 26

GCP

PS C:\_cuda> .\gpusum.exe 1000000000 10000
gpu sum = 1.9999998123, steps 1000000000 terms 10000 threads 512 time 7769.056 ms
PS C:\_cuda>

RTX 4090

micha@13900a MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum.exe 1000000000 10000
gpu sum = 1.9999996113, steps 1000000000 terms 10000 threads 512 time 2233.573 ms

Image the VM

gcloud beta compute machine-images create nvidia-rtx-virtual-workstation-window-4-vm --project=cuda-old --description=nvidia-rtx-virtual-workstation-window-4-vm --source-instance=nvidia-rtx-virtual-workstation-window-4-vm --source-instance-zone=us-central1-a --storage-location=us

Note: you will not always be able to get quota above 1 GPU per region without involving your field sales rep. In this case spread your GPU's over separate regions (assuming you are not moving traffic between the VMs hosting the GPU's)

Try alternative requests for quota - batch an entire set of regions and you may get approved as opposed to specific requests. As you can see only us-east1 has capacity limits - the rest of us-south, us-west and us-east are ok

Your quota request for cuda-old has been partially approved and your project quota has been adjusted according to the following requested limits:

+----------------+------------------+-----------+-----------------+----------------+
| NAME           | DIMENSIONS       | REGION    | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+------------------+-----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4  | us-east4  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-east5  | us-east5  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-south1 | us-south1 | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west1  | us-west1  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west2  | us-west2  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west3  | us-west3  | 2               | 2              |
|                |                  |           |                 |                |
| NVIDIA_L4_GPUS | region=us-west4  | us-west4  | 2               | 2              |
+----------------+------------------+-----------+-----------------+----------------+

Unfortunately, we were unable to grant your below quota request(s):

+----------------+-----------------+----------+
| NAME           | DIMENSIONS      | REGION   |
+----------------+-----------------+----------+
| NVIDIA_L4_GPUS | region=us-east1 | us-east1 |

ask for 4 (note there is a lag on the quota screen - it will say 1 instead of the approved 2 for up to 10 min.
| NAME           | DIMENSIONS      | REGION   | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 |               4 |              4 |

in this case you will need to manually increase the global quota as well from the previous 2

| NAME             | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS |            | GLOBAL |               6 |              4 |

you may run into VM capacity issues in recently working regions

A g2-standard-8 VM instance is currently unavailable in the us-east1-b zone

we can end up with the following after several 60 quota request turnarounds.

An alternative is to sign up for support (Basic is fine) - and raise a P3 support ticket

Another option is to switch billing accounts to one of your organizations that has more spend history - approvals will be easier - then switch the Billing ID back.

| NAME           | DIMENSIONS      | REGION   | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 |               4 |              4 |

| NAME             | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS |            | GLOBAL |               4 |              4

Design Issues

DI 20220807-1: Moving millions of small images between file systems

default is less than < 1m per directory and using Nvme drives on one or both sides of the copy or use 10Gbit ethernet
https://colab.research.google.com/github/christianmerkwirth/colabs/blob/master/Understanding_Randomization_in_TF_Datasets.ipynb

FinOps

Daily Usage Scenarios

0827-1157 3.5 h - getting cost tomorrow single GPU - 1739-1842 dual GPU total $4.75/h

i7-5830 + GTX-970 (2015) i7-5830 + GTX-2070 (2019) i9-13900K + GTX-4090 (2023)

Monthly Usage Scenarios

Sustained Use

Dual L4 GPU

https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022

Check GPU status

PS C:\Users\michael> nvidia-smi
Sun Aug 20 12:40:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25                 Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                    WDDM  | 00000000:00:03.0 Off |                    0 |
| N/A   43C    P8              13W /  72W |    126MiB / 23034MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                    WDDM  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8              13W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

4 GPUs
PS C:\Users\michael> nvidia-smi
Sun Aug 20 13:17:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25                 Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA L4                    WDDM  | 00000000:00:03.0 Off |                    0 |
| N/A   52C    P8              14W /  72W |    143MiB / 23034MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA L4                    WDDM  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8              12W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA L4                    WDDM  | 00000000:00:05.0 Off |                    0 |
| N/A   46C    P8              12W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA L4                    WDDM  | 00000000:00:06.0 Off |                    0 |
| N/A   50C    P8              13W /  72W |      0MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

no NVlink on L4's but we can still run cuda jobs across all GPUs using the PCIe bus

PS C:\Users\michael> nvidia-smi nvlink --status -i 0
PS C:\Users\michael>  nvidia-smi nvlink -g 0 -i 0

Reservations

Appendix D: Build your local HPC

CSPs like GCP provide for scalable on-demand GPUs like the A100. A properly configured local system can also be used for local development as an adjacent dev system to the cloud. The current as of mid 2023 top system from NVIDIA is the 4090. For optimal performance purchase the MSI Suprim Liquid-X which is factory overclocked and contains an integration AIO 240mm liquid cooler.

NVIDIA Quatro RTX GPUs for professional workstations

The NVidia Quatro (now RTX) family of GPUs benefit from the following additional capabilities above consumer grade GEForce GPUs - however this comes at a higher cost.

Stability and Durability under sustained load due to additional binning of select chips that have undergone additional testing by ISVs for TDP profiles.
Improved performance under double precision floating point operations (for cases where deep learning benefits from not reducing precision)
ECC memory to reduce errors and crashes
Commercial Scientific/Design software mandates ECC memory
Physical profiles to add single/dual slot closer placement
Multi-GPU support up to generation Ampere via 112 GBps NVLink - for shared memory without using the PCIe x16 bus

NVIDIA Quatro RTX A4000

The RTX-A4000 does not include 112GBps NVlink like the A4500 and above. Note: the Ada A4500 and above no longer have NVlink and rely on the PCIe Bus which runs at 2GBps per channel up to 32GBps in both directions for PCIe 4.0 x16 - where PCIe 5.0 doubles this to 63 GBps - which is half the speed of NVlink but only if we run the RTX cards on Xeon boards. On z790 boards we split the cards into x8 - yielding 16GBps and in reality 8GBps.

GTX-4000 Workstation Single Slot Placement Screenshot 2023-08-20 at 11 41 18

GTX-4000 Ampere generation - GA-104 under GPU benchmark load = 11 GFlops (20% of a GTX-4090) - https://github.com/ObrienlabsDev/blog/wiki/High-Performance-Computing#appendix-g-benchmark-quatro-rtx-a4000-workstation-gpu---2021---ampere---ga104 Screenshot 2023-08-20 at 09 45 25

NVIDIA Quatro RTX A4500 with NVlink

The RTX-A4500 includes 112GBps NVlink

RTX-A4500 Ampere generation professional workstation cards

2023: NVIDIA RTX-4090 MSI Suprim Liquid-X 24Gb 16384 streaming cores on AIO Liquid Cooled Intel i9-13900KS 8/16 p/e core 128Gb

Dual NVIDIA RTX-4090 MSI Suprim Liquid-X with 32768 streaming cores

Running a dual 4090 Ada Lovelace architecture GPU setup will require the highest end components

GTX-4090 Ada generation consumer cards

Power supply must be 1600W for two reasons. Each 4090 card requires 4 separate PCI power connectors - combine this with 2 CPU power connectors to the motherboard and you require 10 8-pin power outputs. Most 1500W supplies only have 9. The second reason is that a power supply is most efficient running at 50% load. As you will be running up to 11A or close to 1400W you need to have some room. Running a 3rd 4090 is infeasible as it would overload the 1800W fuse on a single 15A line.
Cost $10100 ($1000 CPU + $6000 2 xGPU + $800 motherboard + $800 192G ram + $500 AIO + $600 2 x nVMe SSD + $200 7000 case for 2 4090 coolers + $200 OS) - amortize this over 4 years to get $210/month
Electricity cost: $0.15 to $0.03 KWh. Double this for debt retirement... and an average of $0.15 KWh or $110/month for a max load dual 4090
Total cost of dual 4090 on-prem workstation is $320/month Compare this to a dual GCP L4 which has the performance of a GTX-4060 TI but the memory (24g) of a GTX-4090 running

Runtime Metrics

CPU Power: Separate 340 watts peak
GPU Power: Separate 450 watts peak
CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak

Power equation using the magnetic field around the split AC line using a clamp meter: W = VA or watts = (120v)(12A) = 1400+ Watts. This is getting close to the 1800 W limit on the 15A fuse line with everything else like monitors on a separate 15A line.

Thermals

Single NVIDIA RTX-4090

The i9-13900KS with stock overclock to 6.0 GHz is on back order in Canada for 6 weeks so far - using the non-binned i9-13900K with AI overclocking on the Z790 board resulted in 6.1 GHz is ok for now.

Under Intel Extreme Tuning and Furmark Stress Testing

FLIR Heatmap (CPU incoming radiator on top, GPU outgoing radiator on the side)

GPU running at 22% at 450 frames/sec, CPU at 100%

Cost $7000 ($1000 CPU + $3000 GPU)
CPU Power: Separate 340 watts peak
GPU Power: Separate 450 watts peak
CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak

Compute Level: 8.9 (2023 current highest level)

2020: Mobile NVIDIA Quadro RTX-5000 on Intel Xeon W 10855M 128Gb https://thinkstation-specs.com/thinkpad/p17/

2015: NVIDIA GTX-970 4Gb or GTX-2070 8Gb on AIO Liquid Cooled Intel i7-5820k 6 core Haswell-E 64Gb

Links

Appendix E: Setup a workstation with NVidia CUDA

CUDA on Windows

Install Visual Studio Community 2022

download VisualStudioSetup.exe - https://visualstudio.microsoft.com/downloads/

Get an NVidia developer account

Install NVidia CUDA and NSight Software

install CUDA after installing Microsoft Visual Studio - to get integration working
cuda_12.2.0_536.25_windows.exe from https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=Server2022&target_type=exe_local
GeForce_Experience_v3.27.0.112.exe
nsight_visual_studio_edition-windows-x86_64-2023.1.1.23100_32709422.msi

Verify Configuration

Fix CUDA Configuration: CUDA.props not found

Fix: in VS verify Project | Build Customizations | is set to a later CUDA version if switching machines (12.1 to 12.2)

Error | MSB4019 | The imported project "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.1.props" was not found. Confirm that the expression in the Import declaration "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\\BuildCustomizations\CUDA 12.1.props" is correct, and that the file exists on disk. | gpusum | C:\wse_github_vs\RichardAns\CUDA-Programs\Chapter01\gpusum\gpusum_vs2022.vcxproj | 35 |

VS integration missing for Windows SDK

see https://github.com/obrienlabs/CUDA-Programs/tree/main/Chapter01/gpusum as part of the book from Richard Ansorge of University of Cambridge https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C

Warning | MSB8003 | The WindowsSDKDir property is not defined. Some build tools may not be found. | gpusum | C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets | 513 |

Fix: VS | tools | get tools... | individual components | windows 11 SDK (latest)

Check number of GPUs


	int ngpu = 0;
	cudaGetDeviceCount(&ngpu);
	printf("Number of GPUs on this PC is %d\n",ngpu);


michael@13900b MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum 1000000000 1000
Number of GPUs on this PC is 2

Appendix F: Benchmark Dual 4090 GPUs

Running the two MSI 4090 Suprim Liquid-X GPU at max ram (22/24G x 2 and cores 16384 x 2) consumes 8A or 960 watts above the idle power of 140 watts. Running a fully saturated 13900KS system at 80% CPU and 200% GPU consumes 12.2A or 1460 watts - hence the need for the 10 CPU+PCIE connectors on a 1600 watt PSU - way above it's performance peak of 800W.

The actual performance of the system is 58 x 2 TFlops = 116 Tflops even with the lack of NVlink and the default 16x/4x PCIe lanes - for a non-RAM bound test.

Use the following multi GPU capable library from Ville Timonen instead of furmark.

https://github.com/wilicc/gpu-burn


PS C:\wse_github> git clone https://github.com/wilicc/gpu-burn

PS C:\wse_github\gpu-burn> docker build -t gpu_burn .

PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7%  proc'd: 231 (58146 Gflop/s) - 308 (58428 Gflop/s)   errors: 0 - 0   temps: 58 C - 59 C
        Summary at:   Sat Jul 29 03:02:10 UTC 2023

23.3%  proc'd: 616 (58592 Gflop/s) - 693 (58540 Gflop/s)   errors: 0 - 0   temps: 58 C - 60 C
        Summary at:   Sat Jul 29 03:02:17 UTC 2023

36.7%  proc'd: 1001 (58460 Gflop/s) - 1078 (58479 Gflop/s)   errors: 0 - 0   temps: 59 C - 53 C
        Summary at:   Sat Jul 29 03:02:25 UTC 2023

48.3%  proc'd: 1386 (58381 Gflop/s) - 1463 (58600 Gflop/s)   errors: 0 - 0   temps: 59 C - 60 C
        Summary at:   Sat Jul 29 03:02:32 UTC 2023

60.0%  proc'd: 1771 (58153 Gflop/s) - 1848 (58534 Gflop/s)   errors: 0 - 0   temps: 61 C - 60 C
        Summary at:   Sat Jul 29 03:02:39 UTC 2023

71.7%  proc'd: 2156 (58299 Gflop/s) - 2233 (58364 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sat Jul 29 03:02:46 UTC 2023

83.3%  proc'd: 2541 (58267 Gflop/s) - 2541 (58366 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sat Jul 29 03:02:53 UTC 2023

93.3%  proc'd: 2849 (58368 Gflop/s) - 2926 (58350 Gflop/s)   errors: 0 - 0   temps: 62 C - 61 C
        Summary at:   Sat Jul 29 03:02:59 UTC 2023

100.0%  proc'd: 3080 (58489 Gflop/s) - 3157 (58364 Gflop/s)   errors: 0 - 0   temps: 62 C - 61 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done

Tested 2 GPUs:
        GPU 0: OK
        GPU 1: OK


overclock 0

PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7%  proc'd: 308 (60443 Gflop/s) - 231 (57900 Gflop/s)   errors: 0 - 0   temps: 57 C - 61 C
        Summary at:   Sun Jul 30 23:27:20 UTC 2023

23.3%  proc'd: 693 (60566 Gflop/s) - 616 (57938 Gflop/s)   errors: 0 - 0   temps: 58 C - 61 C
        Summary at:   Sun Jul 30 23:27:27 UTC 2023

35.0%  proc'd: 1001 (60602 Gflop/s) - 1001 (57692 Gflop/s)   errors: 0 - 0   temps: 58 C - 61 C
        Summary at:   Sun Jul 30 23:27:34 UTC 2023

46.7%  proc'd: 1386 (60485 Gflop/s) - 1386 (57818 Gflop/s)   errors: 0 - 0   temps: 59 C - 61 C
        Summary at:   Sun Jul 30 23:27:41 UTC 2023

58.3%  proc'd: 1848 (60455 Gflop/s) - 1694 (57734 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sun Jul 30 23:27:48 UTC 2023

68.3%  proc'd: 2156 (60404 Gflop/s) - 2002 (57703 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sun Jul 30 23:27:54 UTC 2023

80.0%  proc'd: 2541 (60479 Gflop/s) - 2387 (57735 Gflop/s)   errors: 0 - 0   temps: 61 C - 61 C
        Summary at:   Sun Jul 30 23:28:01 UTC 2023

91.7%  proc'd: 2926 (60374 Gflop/s) - 2772 (57690 Gflop/s)   errors: 0 - 0   temps: 60 C - 61 C
        Summary at:   Sun Jul 30 23:28:08 UTC 2023

100.0%  proc'd: 3234 (60315 Gflop/s) - 3157 (57895 Gflop/s)   errors: 0 - 0   temps: 62 C - 55 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done

Tested 2 GPUs:
        GPU 0: OK
        GPU 1: OK

Appendix G: Benchmark Quatro RTX A4000 workstation GPU - 2021 - Ampere - GA104

micha@13900a MINGW64 /c/wse_github/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX A4000 (UUID: GPU-c585959e-3209-a20e-2522-e9420f268bc8)
Using compare file: compare.ptx
Burning for 60 seconds.
16.7%  proc'd: 50 (8805 Gflop/s)   errors: 0   temps: 63 C
        Summary at:   Sat Aug 19 19:49:53 UTC 2023
33.3%  proc'd: 150 (10864 Gflop/s)   errors: 0   temps: 70 C
        Summary at:   Sat Aug 19 19:50:03 UTC 2023
45.0%  proc'd: 250 (10477 Gflop/s)   errors: 0   temps: 75 C
        Summary at:   Sat Aug 19 19:50:10 UTC 2023
58.3%  proc'd: 300 (10344 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Sat Aug 19 19:50:18 UTC 2023
71.7%  proc'd: 400 (10158 Gflop/s)   errors: 0   temps: 79 C
        Summary at:   Sat Aug 19 19:50:26 UTC 2023
83.3%  proc'd: 450 (10092 Gflop/s)   errors: 0   temps: 80 C
        Summary at:   Sat Aug 19 19:50:33 UTC 2023
98.3%  proc'd: 550 (10000 Gflop/s)   errors: 0   temps: 81 C
        Summary at:   Sat Aug 19 19:50:42 UTC 2023
100.0%  proc'd: 600 (9948 Gflop/s)   errors: 0   temps: 81 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16375 MB of memory (14897 MB available, using 13407 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 50 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

Appendix H: Benchmark Quad GCP L4 - no NVlink GPUs

Appendix I: Benchmark Quadro RTX-5000 in Lenovo P17 Gen1 workstation GPU - 2018 - Turing - TU104

Mobile Xeon W-10855M 2.8-4.7GHz version of https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/quadro-rtx-5000-data-sheet-us-nvidia-704120-r4-web.pdf

Benchmark


PS C:\Users\micha> nvidia-smi
Sun Aug 20 20:46:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 5000              WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   65C    P2             107W / 110W |  14998MiB / 16384MiB |    100%      Default |
|                                         |                      |                  N/A |

micha@LAPTOP-M4VQDR8K MINGW64 /c/_dev/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: Quadro RTX 5000 (UUID: GPU-e180b2a0-1a44-e625-8ac1-551fe7b1ee35)
Using compare file: compare.ptx
Burning for 60 seconds.
18.3%  proc'd: 51 (5247 Gflop/s)   errors: 0   temps: 61 C
        Summary at:   Mon Aug 21 00:39:19 UTC 2023
31.7%  proc'd: 102 (6796 Gflop/s)   errors: 0   temps: 62 C
        Summary at:   Mon Aug 21 00:39:27 UTC 2023
45.0%  proc'd: 153 (6737 Gflop/s)   errors: 0   temps: 65 C
        Summary at:   Mon Aug 21 00:39:35 UTC 2023
58.3%  proc'd: 153 (6737 Gflop/s)   errors: 0   temps: 66 C
        Summary at:   Mon Aug 21 00:39:43 UTC 2023
73.3%  proc'd: 255 (6646 Gflop/s)   errors: 0   temps: 67 C
        Summary at:   Mon Aug 21 00:39:52 UTC 2023
88.3%  proc'd: 306 (6616 Gflop/s)   errors: 0   temps: 67 C
        Summary at:   Mon Aug 21 00:40:01 UTC 2023
100.0%  proc'd: 306 (6616 Gflop/s)   errors: 0   temps: 68 C
        Summary at:   Mon Aug 21 00:40:08 UTC 2023

100.0%  proc'd: 357 (6607 Gflop/s)   errors: 0   temps: 68 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16383 MB of memory (15085 MB available, using 13576 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 51 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

Appendix J: RTX-3500 Ada generation mobile GPU in Lenovo P1 Gen 6 - i13800H - 202310

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

GPU 0: NVIDIA RTX 3500 Ada Generation Laptop GPU (UUID: GPU-25326b0f-ad93-c319-7027-b0029d4aee8e)
Using compare file: compare.ptx
Burning for 60 seconds.
13.3%  proc'd: 74 (11539 Gflop/s)   errors: 0   temps: 77 C
        Summary at:   Mon Oct 30 23:44:52 UTC 2023

25.0%  proc'd: 111 (11484 Gflop/s)   errors: 0   temps: 80 C
        Summary at:   Mon Oct 30 23:44:59 UTC 2023

36.7%  proc'd: 222 (11651 Gflop/s)   errors: 0   temps: 81 C
        Summary at:   Mon Oct 30 23:45:06 UTC 2023

48.3%  proc'd: 296 (11519 Gflop/s)   errors: 0   temps: 81 C
        Summary at:   Mon Oct 30 23:45:13 UTC 2023

61.7%  proc'd: 370 (11379 Gflop/s)   errors: 0   temps: 83 C
        Summary at:   Mon Oct 30 23:45:21 UTC 2023

75.0%  proc'd: 444 (12958 Gflop/s)   errors: 0   temps: 85 C
        Summary at:   Mon Oct 30 23:45:29 UTC 2023

86.7%  proc'd: 555 (12996 Gflop/s)   errors: 0   temps: 85 C
        Summary at:   Mon Oct 30 23:45:36 UTC 2023

100.0%  proc'd: 629 (13163 Gflop/s)   errors: 0   temps: 85 C
        Summary at:   Mon Oct 30 23:45:44 UTC 2023

100.0%  proc'd: 666 (13124 Gflop/s)   errors: 0   temps: 85 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 12281 MB of memory (11119 MB available, using 10007 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 37 iterations
Freed memory for dev 0
Uninitted cublas
done

Tested 1 GPUs:
        GPU 0: OK

micha@p1gen6 MINGW64 /c/wse_github/gpu-burn (master)
$ nvidia-smi
Mon Oct 30 19:48:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84                 Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 3500 Ada Gene...  WDDM  | 00000000:01:00.0 Off |                  Off |
| N/A   51C    P8               7W / 102W |    122MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Appendix X: Older GPUs

Fermi

Kepler

Maxwell

GTX-970

C:\_dev>nvidia-smi
Mon Aug 21 20:33:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67                 Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 970       WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   41C    P8              18W / 163W |    546MiB /  4096MiB |     13%      Default |

Pascal

Volta

Ampere

Appendix Y: NVidia GPU Comparisons

GPU Matrix

Data Center GPUs

https://cloud.google.com/nvidia
https://aws.amazon.com/nvidia/
Full 400 GBps NVLink
https://www.coreweave.com/products/hgx-h100
https://developer.nvidia.com/blog/breaking-mlperf-training-records-with-nvidia-h100-gpus/
https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh200-ai-supercomputer
H3 VMs with H100 GPUs are available on public preview in select regions (us-central1-a, europe-west4-b/c)

gcloud compute instances create gpu-h100-usc1a --project=cuda-obs --zone=us-central1-a --machine-type=h3-standard-88 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=816469424864-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=gpu-h100-usc1a,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230814,mode=rw,size=10,type=projects/cuda-obs/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any

Quadro : RTX Workstation GPUs

GEForce : GTX Consumer GPUs

as of 40xx no more NVLink

Appendix Z1: Google Cloud Developer Day Presentation

Plans

GCP HPC vs On-Prem HPC - LLM from scratch as part of preparing for Generative AI
alternate: GCP HPC vs On-Prem HPC - for live drone AI Entity Extraction

GCP HPC vs On-Prem HPC - LLM from scratch as part of preparing for Generative AI

Walkthrough

Start with the following site - read it in its entirety https://jaykmody.com/blog/gpt-from-scratch/ or https://github.com/lm-sys/FastChat

Code

Run Tensorflow in docker

Run CUDA in Kubernetes

Run CUDA in docker

Dockerfile

FROM nvidia/cuda:12.2.0-devel-ubi8
CMD nvidia-smi

docker build -t nvidia-smi .
docker run --rm --gpus all nvidia-smi 

Mon Oct  9 20:23:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01              Driver Version: 536.67       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro P1000                   On  | 00000000:01:00.0  On |                  N/A |
| N/A   45C    P8              N/A /  20W |    543MiB /  4096MiB |      4%      Default |

The key to GPU passthrough to docker is the --gpus variable - if you don't set it you will get the following

$ docker run --rm nvidia-smi

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/bin/sh: nvidia-smi: command not found

CUDA runtime in docker

CUDA development in docker

See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html

runtime
docker run -dit --name cuda-runtime nvidia/cuda:11.8.0-runtime-ubi8
or dev
docker run -dit --name cuda-devel nvidia/cuda:11.8.0-devel-ubi8

Ubuntu image

FROM ubuntu:22.04

RUN apt-get update && apt-get -y install sudo

RUN useradd -m docker && echo "docker:docker" | chpasswd && adduser docker sudo

RUN apt-get install curl -y
RUN apt-get install gpg -y

# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt-get update
RUN sudo apt-get install -y nvidia-container-toolkit
USER docker
CMD /bin/bash


sudo nvidia-ctk runtime configure --runtime=docker

Demo

Training

https://course.fast.ai/
Start with Introduction to Generative AI Learning Path at GCP - https://www.cloudskillsboost.google/journeys/118
Next - Generative AI for Developers Learning Path at GCP - https://www.cloudskillsboost.google/journeys/183

Appendix

Appendix A: LLM from scratch via Google C4 - Colossal Clean Crawled Corpus

P.256 of Generative Deep Learning 2nd Edition - David Foster https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9 https://github.com/allenai/allennlp/discussions/5056 https://support.terra.bio/hc/en-us/community/posts/4787320149915-Requester-Pays-Google-buckets-not-asking-for-project-to-bill

Google Project Gemini https://blog.google/technology/ai/google-io-2023-keynote-sundar-pichai/#ai-products

C4 = Colossal Clean Crawled Corpus start 20231203:0021 - estimate $100 US for gcs egress An average of 300mbps with peaks of 900mbps from the GCP bucket means 800GB x 8 bits = 6400Gbits at .3Gbps = 6hours ~ ETA 36GB in 26 min = 25MB/sec = 200mbps = 11h (possibly limited by the hdd - go directly to NVMe next time

checked 0845 done
copy test HDD to HDD no raid -849 ~ 1330 = 4.5h
HDD to NVMe 1400-1455 - 250Mbps ~1h
copy test NVMe to NVMe 1456- 4-8 min 3.4-1.4 GB/s (thermal throttling) (990 pro 50% of max 8GB/s)

$93 US for GCS egress Screenshot 2023-12-04 at 09 39 39