-
Notifications
You must be signed in to change notification settings - Fork 0
CUDA based ‐ High Performance Computing ‐ LLM Training ‐ Ground to GCP Cloud Hybrid
Tracking - https://github.com/ObrienlabsDev/blog/issues/1 and https://github.com/ObrienlabsDev/blog/issues/6
I am planning a talk around CUDA on Google Cloud (GCP) late 2023. If you would like to attend let me know at fmichaelobrien at google.com. I will be be posting the official 1-2h meet details very late 2023 - after Google Next 23
Meta LLaMA 2 70B model on Apple M2 Ultra 64G via https://github.com/ggerganov/llama.cpp https://github.com/ObrienlabsDev/machine-learning/issues/7
HPC and GPU computing where the use of streaming processors in NVidia GPUs can be done on custom equipment like a dual MS GTX 4090 at 58 x 2 TFlops running over 1000W with access to 2 sets of 24G vram and 32k (32768 cores). If additional capacity or scaling is required then the use of cloud based GPUs like the Ada Lovelace L4 or L40 until the Grade Hopper based H100 is available late 2023 (public preview as of 1 Sept 2023) on GCP is advisable. This article is a description of GPU onboarding to GCP from a development and throughput perspective with a focus from the ground up through C++ based CUDA up through layers of ML/deep-learning libraries towards use cases such as LLM training or real time video entitity extraction for example.
We will review use of CUDA executables both in NVidia workstation VMs on GCP and via Kubernetes containers in Kubeflow
Nvidia represents the new supercomputer manufacturer. When CUDA was introduced by Nvidia - essentially opening up an exponential multiplier of streaming GPU processors - we could not have imagined we would be at 16k (16384) cores per chip in 2023. A dual 4090 represents 32k (32768) parallel processors (close to the 64k original connection machine of 1988). Nvidia has flipped the normal GPU beside the CPU as a co-processor to effectively rendering the CPU partially irrelevant (even a 13900KS with 32 threads). The HPC is now a effectively single GPU (whether by NVLink for data center GPUs or PCIe 5.0 for consumer CPUs) - where the CPU is the coprocessor that feeds the GPU.
Disclaimer: I am new to CUDA, I last worked with GPUs under DirectX 5.0 in 1999 - I go directly to Compute Capability 8.9 under Ada Lovelace
- Turn off organization policies compute.vmExternalIpAccess and compute.requireShieldedVm before - see https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/426 and https://github.com/GoogleCloudPlatform/pbmm-on-gcp-onboarding/issues/252 for details
- Performance is expected lower than an Nvidia RTX-4090 (16384 cores) OC liquid cooled. 7700 ms vs 2200 ms on a PI calculation.
- no need for manual driver install https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=11&target_type=deb_local - but follow
- https://cloud.google.com/compute/docs/gpus/install-drivers-gpu
- supplied drivers : | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
michael@cloudshell:~ (clouddeploy-ol)$ gcloud config set project cuda-old
Updated property [core/project].
michael@cloudshell:~ (cuda-old)$ gcloud compute instances create l4-4-2 --project=cuda-old --zone=us-east4-c --machine-type=g2-standard-24 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=196717963363-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --accelerator=count=2,type=nvidia-l4 --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=l4-4-2,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310,mode=rw,size=50,type=projects/cuda-old/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
Created [https://www.googleapis.com/compute/v1/projects/cuda-old/zones/us-east4-c/instances/l4-4-2].
NAME: l4-4-2
ZONE: us-east4-c
MACHINE_TYPE: g2-standard-24
PREEMPTIBLE:
INTERNAL_IP: 10.150.0.10
EXTERNAL_IP: 34.
STATUS: RUNNING
ssh
======================================
Welcome to the Google Deep Learning VM
======================================
Version: common-gpu.m113
Resources:
* Google Deep Learning Platform StackOverflow: https://stackoverflow.com/questions/tagged/google-dl-platform
* Google Cloud Documentation: https://cloud.google.com/deep-learning-vm
* Google Group: https://groups.google.com/forum/#!forum/google-dl-platform
To reinstall Nvidia driver (if needed) run:
sudo /opt/deeplearning/install-driver.sh
Linux l4-4-2 5.10.0-26-cloud-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
This VM requires Nvidia drivers to function correctly. Installation takes ~1 minute.
Would you like to install the Nvidia driver? [y/n]
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17......
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with
this installation of the NVIDIA driver.
ok
running a python vm
(base) michael@l4-4-2:~$ nvidia-smi
Thu Nov 30 19:51:56 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA L4 Off | 00000000:00:03.0 Off | 0 |
| N/A 60C P0 32W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 Off | 00000000:00:04.0 Off | 0 |
| N/A 57C P0 31W / 72W | 0MiB / 23034MiB | 7% Default |
| | | N/A |
Run a standard concurrent saturation TensorFlow/Keras ML job from U of Toronto to check batch size optimums under 30 epochs to get close to 1.0 fitness - 25 avoids overfit
https://github.com/ObrienlabsDev/machine-learning
base) michael@l4-4-2:~$ git clone https://github.com/ObrienlabsDev/machine-learning.git
(base) michael@l4-4-2:~/machine-learning$ vi environments/windows/src/tflow.py
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"])
cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
with strategy.scope():
# https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
# https://keras.io/api/models/model/
parallel_model = tf.keras.applications.ResNet50(
include_top=True,
weights=None,
input_shape=(32, 32, 3),
classes=100,)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
# https://keras.io/api/models/model_training_apis/
parallel_model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
parallel_model.fit(x_train, y_train, epochs=30, batch_size=2048)#5120)#7168)#7168)
(base) michael@l4-4-2:~/machine-learning$ cat environments/windows/Dockerfile
FROM tensorflow/tensorflow:latest-gpu
WORKDIR /src
COPY /src/tflow.py .
CMD ["python", "tflow.py"]
base) michael@l4-4-2:~/machine-learning$ ./build.sh
Sending build context to Docker daemon 6.656kB
Step 1/4 : FROM tensorflow/tensorflow:latest-gpu
latest-gpu: Pulling from tensorflow/tensorflow
successfully tagged ml-tensorflow-win:latest
2023-11-30 20:29:26.443809: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-11-30 20:29:26.497571: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-30 20:29:26.497614: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-30 20:29:26.499104: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-30 20:29:26.506731: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-30 20:29:31.435829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20795 MB memory: -> device: 0, name: NVIDIA L4, pci bus id: 0000:00:03.0, compute capability: 8.9
2023-11-30 20:29:31.437782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20795 MB memory: -> device: 1, name: NVIDIA L4, pci bus id: 0000:00:04.0, compute capability: 8.9
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
169001437/169001437 [==============================] - 3s 0us/step
Epoch 1/30
023-11-30 20:30:19.985861: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:20.001134: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2023-11-30 20:30:29.957119: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f9c6bf3a4f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-11-30 20:30:29.957184: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.957192: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (1): NVIDIA L4, Compute Capability 8.9
2023-11-30 20:30:29.965061: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1701376230.063893 80 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
25/25 [==============================] - 71s 317ms/step - loss: 4.9465 - accuracy: 0.0418
Epoch 2/30
25/25 [==============================] - 4s 142ms/step - loss: 3.8430 - accuracy: 0.1214
Epoch 3/30
25/25 [==============================] - 4s 142ms/step - loss: 3.3694 - accuracy: 0.1967
Epoch 4/30
25/25 [==============================] - 4s 143ms/step - loss: 3.0832 - accuracy: 0.2544
Epoch 5/30
25/25 [==============================] - 4s 143ms/step - loss: 2.7049 - accuracy: 0.3326
Epoch 6/30
25/25 [==============================] - 4s 143ms/step - loss: 2.3329 - accuracy: 0.4119
Epoch 7/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9781 - accuracy: 0.4824
Epoch 8/30
25/25 [==============================] - 4s 143ms/step - loss: 1.9177 - accuracy: 0.4948
Epoch 9/30
25/25 [==============================] - 4s 142ms/step - loss: 1.4980 - accuracy: 0.5937
Epoch 10/30
25/25 [==============================] - 4s 144ms/step - loss: 1.3247 - accuracy: 0.6322
Epoch 11/30
25/25 [==============================] - 4s 142ms/step - loss: 1.0408 - accuracy: 0.7063
Epoch 12/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9150 - accuracy: 0.7439
Epoch 13/30
25/25 [==============================] - 4s 143ms/step - loss: 0.8210 - accuracy: 0.7648
Epoch 14/30
25/25 [==============================] - 4s 142ms/step - loss: 0.5581 - accuracy: 0.8424
Epoch 15/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4635 - accuracy: 0.8709
Epoch 16/30
25/25 [==============================] - 4s 142ms/step - loss: 0.4771 - accuracy: 0.8610
Epoch 17/30
25/25 [==============================] - 4s 142ms/step - loss: 0.9404 - accuracy: 0.7228
Epoch 18/30
25/25 [==============================] - 4s 143ms/step - loss: 0.5478 - accuracy: 0.8385
Epoch 19/30
25/25 [==============================] - 4s 143ms/step - loss: 0.4107 - accuracy: 0.8867
Epoch 20/30
25/25 [==============================] - 4s 143ms/step - loss: 0.2424 - accuracy: 0.9345
Epoch 21/30
25/25 [==============================] - 4s 146ms/step - loss: 0.1677 - accuracy: 0.9587
Epoch 22/30
25/25 [==============================] - 4s 142ms/step - loss: 0.1419 - accuracy: 0.9659
Epoch 23/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1861 - accuracy: 0.9510
Epoch 24/30
25/25 [==============================] - 4s 141ms/step - loss: 0.2771 - accuracy: 0.9264
Epoch 25/30
25/25 [==============================] - 4s 142ms/step - loss: 0.2663 - accuracy: 0.9326
Epoch 26/30
25/25 [==============================] - 4s 141ms/step - loss: 0.1710 - accuracy: 0.9600
Epoch 27/30
25/25 [==============================] - 4s 141ms/step - loss: 0.4977 - accuracy: 0.8626
Epoch 28/30
25/25 [==============================] - 4s 141ms/step - loss: 0.6559 - accuracy: 0.8100
Epoch 29/30
25/25 [==============================] - 4s 143ms/step - loss: 0.3074 - accuracy: 0.9105
Epoch 30/30
25/25 [==============================] - 4s 143ms/step - loss: 0.1834 - accuracy: 0.9515
(base) michael@l4-4-2:~/machine-learning$
Batch = 2048, epochs = 25
Epoch 24/25
25/25 [==============================] - 4s 144ms/step - loss: 0.2537 - accuracy: 0.9221
Epoch 25/25
25/25 [==============================] - 4s 145ms/step - loss: 0.2258 - accuracy: 0.9300
Use L4 (without NVlink) at US$2/h instead of A100's for now https://console.cloud.google.com/marketplace/product/nvidia/nvidia-rtx-virtual-workstation-windows-server-2022
Via RDP (pw change) default open port - us-central1 (see also sa) on g2-standard-8 and gpu Nvidia L4
- https://pages.awscloud.com/ec2-virtual-workstations-media-entertainment.html
- https://aws.amazon.com/solutions/case-studies/untold-studios-case-study/
Get yourself a good machine for local development before offloading to GCP for compute tasks. An example from 2015 is an Intel i7-5820k 6 core Haswell-E and an NVIDIA GTX970 or newer 2070. My current machine build is based on the Intel i9-13900K 8/16 core and an NVIDIA GTX4090 MSI Liquid-X - see the Appendix A: Build your local HPC
-
The start of transformers - - Google Brain - Transformer paper "Attention is all you need" https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
-
https://www.width.ai/post/train-and-deploy-vicuna-and-fastchat
-
https://www.projectpro.io/projects/data-science-projects/pytorch
-
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/
-
Infiniband https://store.nvidia.com/en-us/networking/store/?page=1&limit=9&locale=en-us
-
65 billion parameter LLM on 2 x 4090 https://github.com/kuleshov-group/llmtune
-
https://www.linkedin.com/pulse/google-io-updates-training-65b-model-single-gpu-hf-agent-cherukuri
-
https://developers.google.com/machine-learning/crash-course/prereqs-and-prework
-
https://cloud.google.com/blog/products/compute/introducing-g2-vms-with-nvidia-l4-gpus https://www.nvidia.com/en-us/data-center/l4/
-
Kubernetes and GPUs - https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#overview
-
LLM from scratch using JAX - https://jaykmody.com/blog/gpt-from-scratch/
- https://github.com/ObrienlabsDev/blog/wiki/Drone-Streaming-Extraction
- Tracking Objects in Video - https://arxiv.org/pdf/2304.11968v2.pdf from https://www.linkedin.com/pulse/google-io-updates-training-65b-model-single-gpu-hf-agent-cherukuri
- https://cloud.google.com/video-intelligence
- https://cloud.google.com/blog/products/data-analytics/pubsub-cloud-storage-subscriptions-for-streaming-data-ingestion
- https://cloud.google.com/compute/docs/eol/k80-eol
- https://cloud.google.com/compute/docs/gpus#nvidia_t4_gpus
if you get "Failed to start nvidia-rtx-virtual-workstation-window-4-vm: Operation type [start] failed with message "Quota 'NVIDIA_L4_GPUS' exceeded. Limit: 1.0 in region us-central1.""
2 min.
You only need to ask for NVIDIA_T4_VWS_GPUS and the bot will auto add GPUS_ALL_REGIONS
Your quota request for eventstream-dev has been approved and your project quota has been adjusted according to the following requested limits:
+--------------------+--------------------+-------------+-----------------+----------------+
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+--------------------+--------------------+-------------+-----------------+----------------+
| GPUS_ALL_REGIONS | | GLOBAL | 2 | 2 |
| | | | | |
| NVIDIA_T4_VWS_GPUS | region=us-central1 | us-central1 | 2 | 2 |
+--------------------+--------------------+-------------+-----------------+----------------+
and NVIDIA_L4_GPUS
- https://cloud.google.com/compute/docs/virtual-workstation/windows-gpu
- pricing https://cloud.google.com/compute/gpus-pricing#gpus
- see https://console.cloud.google.com/marketplace/product/nvidia-ngc-public/nvidia-gpu-optimized-vmi
- using
Compute Engine API Cloud Deployment Manager V2 API Cloud Runtime Configuration API
Increase Quota
Change NVIDIA A100 GPUs - us-central1 from 0 to 1
Change GPUs (all regions) from 0 to 1
2 of 3 Quotas
5th - switched billing account with payment history
Change NVIDIA A100 GPUs - us-central1 from 0 to 1
"Unfortunately, we are unable to grant you additional quota at this time. If this is a new project please wait 48h until you resubmit the request or until your Billing account has additional history."
Switching to another org with 1+ year of payment history - same thing - an issue with A100 availability
Switching GPU to V100 or T4
port 22 only
nvidia-gpu-cloud-hpc-sdk-image has resource level errors
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}
same in singapore
nvidia-gpu-cloud-hpc-sdk-image-1-vm: {"ResourceType":"compute.v1.instance","ResourceErrorCode":"ZONE_RESOURCE_POOL_EXHAUSTED","ResourceErrorMessage":"The zone 'projects/cuda-obs/zones/asia-southeast1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later."}
Southamerica-east-1c has capacity
Deployed OK
copy cuda executable example to GCP VM
GCP
PS C:\_cuda> .\gpusum.exe 1000000000 10000
gpu sum = 1.9999998123, steps 1000000000 terms 10000 threads 512 time 7769.056 ms
PS C:\_cuda>
RTX 4090
micha@13900a MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum.exe 1000000000 10000
gpu sum = 1.9999996113, steps 1000000000 terms 10000 threads 512 time 2233.573 ms
Image the VM
gcloud beta compute machine-images create nvidia-rtx-virtual-workstation-window-4-vm --project=cuda-old --description=nvidia-rtx-virtual-workstation-window-4-vm --source-instance=nvidia-rtx-virtual-workstation-window-4-vm --source-instance-zone=us-central1-a --storage-location=us
Note: you will not always be able to get quota above 1 GPU per region without involving your field sales rep. In this case spread your GPU's over separate regions (assuming you are not moving traffic between the VMs hosting the GPU's)
Try alternative requests for quota - batch an entire set of regions and you may get approved as opposed to specific requests. As you can see only us-east1 has capacity limits - the rest of us-south, us-west and us-east are ok
Your quota request for cuda-old has been partially approved and your project quota has been adjusted according to the following requested limits:
+----------------+------------------+-----------+-----------------+----------------+
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+------------------+-----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-east5 | us-east5 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-south1 | us-south1 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-west1 | us-west1 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-west2 | us-west2 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-west3 | us-west3 | 2 | 2 |
| | | | | |
| NVIDIA_L4_GPUS | region=us-west4 | us-west4 | 2 | 2 |
+----------------+------------------+-----------+-----------------+----------------+
Unfortunately, we were unable to grant your below quota request(s):
+----------------+-----------------+----------+
| NAME | DIMENSIONS | REGION |
+----------------+-----------------+----------+
| NVIDIA_L4_GPUS | region=us-east1 | us-east1 |
ask for 4 (note there is a lag on the quota screen - it will say 1 instead of the approved 2 for up to 10 min.
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 | 4 | 4 |
in this case you will need to manually increase the global quota as well from the previous 2
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS | | GLOBAL | 6 | 4 |
you may run into VM capacity issues in recently working regions
A g2-standard-8 VM instance is currently unavailable in the us-east1-b zone
we can end up with the following after several 60 quota request turnarounds.
An alternative is to sign up for support (Basic is fine) - and raise a P3 support ticket
Another option is to switch billing accounts to one of your organizations that has more spend history - approvals will be easier - then switch the Billing ID back.
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+----------------+-----------------+----------+-----------------+----------------+
| NVIDIA_L4_GPUS | region=us-east4 | us-east4 | 4 | 4 |
| NAME | DIMENSIONS | REGION | REQUESTED LIMIT | APPROVED LIMIT |
+------------------+------------+--------+-----------------+----------------+
| GPUS_ALL_REGIONS | | GLOBAL | 4 | 4
- default is less than < 1m per directory and using Nvme drives on one or both sides of the copy or use 10Gbit ethernet
- https://colab.research.google.com/github/christianmerkwirth/colabs/blob/master/Understanding_Randomization_in_TF_Datasets.ipynb
0827-1157 3.5 h - getting cost tomorrow single GPU - 1739-1842 dual GPU total $4.75/h
i7-5830 + GTX-970 (2015) i7-5830 + GTX-2070 (2019) i9-13900K + GTX-4090 (2023)
Check GPU status
PS C:\Users\michael> nvidia-smi
Sun Aug 20 12:40:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25 Driver Version: 536.25 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 WDDM | 00000000:00:03.0 Off | 0 |
| N/A 43C P8 13W / 72W | 126MiB / 23034MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 WDDM | 00000000:00:04.0 Off | 0 |
| N/A 50C P8 13W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
4 GPUs
PS C:\Users\michael> nvidia-smi
Sun Aug 20 13:17:20 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.25 Driver Version: 536.25 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L4 WDDM | 00000000:00:03.0 Off | 0 |
| N/A 52C P8 14W / 72W | 143MiB / 23034MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA L4 WDDM | 00000000:00:04.0 Off | 0 |
| N/A 48C P8 12W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA L4 WDDM | 00000000:00:05.0 Off | 0 |
| N/A 46C P8 12W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA L4 WDDM | 00000000:00:06.0 Off | 0 |
| N/A 50C P8 13W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
no NVlink on L4's but we can still run cuda jobs across all GPUs using the PCIe bus
PS C:\Users\michael> nvidia-smi nvlink --status -i 0
PS C:\Users\michael> nvidia-smi nvlink -g 0 -i 0
CSPs like GCP provide for scalable on-demand GPUs like the A100. A properly configured local system can also be used for local development as an adjacent dev system to the cloud. The current as of mid 2023 top system from NVIDIA is the 4090. For optimal performance purchase the MSI Suprim Liquid-X which is factory overclocked and contains an integration AIO 240mm liquid cooler.
The NVidia Quatro (now RTX) family of GPUs benefit from the following additional capabilities above consumer grade GEForce GPUs - however this comes at a higher cost.
- Stability and Durability under sustained load due to additional binning of select chips that have undergone additional testing by ISVs for TDP profiles.
- Improved performance under double precision floating point operations (for cases where deep learning benefits from not reducing precision)
- ECC memory to reduce errors and crashes
- Commercial Scientific/Design software mandates ECC memory
- Physical profiles to add single/dual slot closer placement
- Multi-GPU support up to generation Ampere via 112 GBps NVLink - for shared memory without using the PCIe x16 bus
The RTX-A4000 does not include 112GBps NVlink like the A4500 and above. Note: the Ada A4500 and above no longer have NVlink and rely on the PCIe Bus which runs at 2GBps per channel up to 32GBps in both directions for PCIe 4.0 x16 - where PCIe 5.0 doubles this to 63 GBps - which is half the speed of NVlink but only if we run the RTX cards on Xeon boards. On z790 boards we split the cards into x8 - yielding 16GBps and in reality 8GBps.
- https://www.nvidia.com/en-us/design-visualization/rtx-a4500/
- https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/rtx/nvidia-rtx-a4500-datasheet.pdf
- https://resources.nvidia.com/en-us-design-viz-stories-ep/rtx-4000-ada-datashe?lx=CCKW39&contentType=data-sheet
GTX-4000 Workstation Single Slot Placement
GTX-4000 Ampere generation - GA-104 under GPU benchmark load = 11 GFlops (20% of a GTX-4090) - https://github.com/ObrienlabsDev/blog/wiki/High-Performance-Computing#appendix-g-benchmark-quatro-rtx-a4000-workstation-gpu---2021---ampere---ga104
The RTX-A4500 includes 112GBps NVlink
- https://www.nvidia.com/en-us/design-visualization/rtx-a4500/
- https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/rtx/nvidia-rtx-a4500-datasheet.pdf
2023: NVIDIA RTX-4090 MSI Suprim Liquid-X 24Gb 16384 streaming cores on AIO Liquid Cooled Intel i9-13900KS 8/16 p/e core 128Gb
Running a dual 4090 Ada Lovelace architecture GPU setup will require the highest end components
- Power supply must be 1600W for two reasons. Each 4090 card requires 4 separate PCI power connectors - combine this with 2 CPU power connectors to the motherboard and you require 10 8-pin power outputs. Most 1500W supplies only have 9. The second reason is that a power supply is most efficient running at 50% load. As you will be running up to 11A or close to 1400W you need to have some room. Running a 3rd 4090 is infeasible as it would overload the 1800W fuse on a single 15A line.
- Cost $10100 ($1000 CPU + $6000 2 xGPU + $800 motherboard + $800 192G ram + $500 AIO + $600 2 x nVMe SSD + $200 7000 case for 2 4090 coolers + $200 OS) - amortize this over 4 years to get $210/month
- Electricity cost: $0.15 to $0.03 KWh. Double this for debt retirement... and an average of $0.15 KWh or $110/month for a max load dual 4090
- Total cost of dual 4090 on-prem workstation is $320/month Compare this to a dual GCP L4 which has the performance of a GTX-4060 TI but the memory (24g) of a GTX-4090 running
- CPU Power: Separate 340 watts peak
- GPU Power: Separate 450 watts peak
- CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak
Power equation using the magnetic field around the split AC line using a clamp meter: W = VA or watts = (120v)(12A) = 1400+ Watts. This is getting close to the 1800 W limit on the 15A fuse line with everything else like monitors on a separate 15A line.
Thermals
The i9-13900KS with stock overclock to 6.0 GHz is on back order in Canada for 6 weeks so far - using the non-binned i9-13900K with AI overclocking on the Z790 board resulted in 6.1 GHz is ok for now.
Under Intel Extreme Tuning and Furmark Stress Testing
FLIR Heatmap (CPU incoming radiator on top, GPU outgoing radiator on the side)
GPU running at 22% at 450 frames/sec, CPU at 100%
- Cost $7000 ($1000 CPU + $3000 GPU)
- CPU Power: Separate 340 watts peak
- GPU Power: Separate 450 watts peak
- CPU+GPU Power: Combined (CPU under 80% utilization) 720 watts (6A x 120V) peak
- Compute Level: 8.9 (2023 current highest level)
2020: Mobile NVIDIA Quadro RTX-5000 on Intel Xeon W 10855M 128Gb https://thinkstation-specs.com/thinkpad/p17/
-
https://www.run.ai/guides/slurm/understanding-slurm-gpu-management
-
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/
-
https://store.nvidia.com/en-us/networking/store/?page=1&limit=9&locale=en-us
- download VisualStudioSetup.exe - https://visualstudio.microsoft.com/downloads/
- https://developer.nvidia.com/downloads and https://developer.nvidia.com/nvidia-developer-program
- https://github.com/NVIDIA
- install CUDA after installing Microsoft Visual Studio - to get integration working
- cuda_12.2.0_536.25_windows.exe from https://developer.nvidia.com/cuda-downloads?target_os=Windows&target_arch=x86_64&target_version=Server2022&target_type=exe_local
- GeForce_Experience_v3.27.0.112.exe
- nsight_visual_studio_edition-windows-x86_64-2023.1.1.23100_32709422.msi
Fix: in VS verify Project | Build Customizations | is set to a later CUDA version if switching machines (12.1 to 12.2)
Error | MSB4019 | The imported project "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.1.props" was not found. Confirm that the expression in the Import declaration "C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\\BuildCustomizations\CUDA 12.1.props" is correct, and that the file exists on disk. | gpusum | C:\wse_github_vs\RichardAns\CUDA-Programs\Chapter01\gpusum\gpusum_vs2022.vcxproj | 35 |
see https://github.com/obrienlabs/CUDA-Programs/tree/main/Chapter01/gpusum as part of the book from Richard Ansorge of University of Cambridge https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C
Warning | MSB8003 | The WindowsSDKDir property is not defined. Some build tools may not be found. | gpusum | C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\Microsoft.CppBuild.targets | 513 |
Fix: VS | tools | get tools... | individual components | windows 11 SDK (latest)
int ngpu = 0;
cudaGetDeviceCount(&ngpu);
printf("Number of GPUs on this PC is %d\n",ngpu);
michael@13900b MINGW64 /c/wse_github_vs/RichardAns/CUDA-Programs/Chapter01/gpusum/x64/Release (main)
$ ./gpusum 1000000000 1000
Number of GPUs on this PC is 2
Running the two MSI 4090 Suprim Liquid-X GPU at max ram (22/24G x 2 and cores 16384 x 2) consumes 8A or 960 watts above the idle power of 140 watts. Running a fully saturated 13900KS system at 80% CPU and 200% GPU consumes 12.2A or 1460 watts - hence the need for the 10 CPU+PCIE connectors on a 1600 watt PSU - way above it's performance peak of 800W.
The actual performance of the system is 58 x 2 TFlops = 116 Tflops even with the lack of NVlink and the default 16x/4x PCIe lanes - for a non-RAM bound test.
Use the following multi GPU capable library from Ville Timonen instead of furmark.
https://github.com/wilicc/gpu-burn
PS C:\wse_github> git clone https://github.com/wilicc/gpu-burn
PS C:\wse_github\gpu-burn> docker build -t gpu_burn .
PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7% proc'd: 231 (58146 Gflop/s) - 308 (58428 Gflop/s) errors: 0 - 0 temps: 58 C - 59 C
Summary at: Sat Jul 29 03:02:10 UTC 2023
23.3% proc'd: 616 (58592 Gflop/s) - 693 (58540 Gflop/s) errors: 0 - 0 temps: 58 C - 60 C
Summary at: Sat Jul 29 03:02:17 UTC 2023
36.7% proc'd: 1001 (58460 Gflop/s) - 1078 (58479 Gflop/s) errors: 0 - 0 temps: 59 C - 53 C
Summary at: Sat Jul 29 03:02:25 UTC 2023
48.3% proc'd: 1386 (58381 Gflop/s) - 1463 (58600 Gflop/s) errors: 0 - 0 temps: 59 C - 60 C
Summary at: Sat Jul 29 03:02:32 UTC 2023
60.0% proc'd: 1771 (58153 Gflop/s) - 1848 (58534 Gflop/s) errors: 0 - 0 temps: 61 C - 60 C
Summary at: Sat Jul 29 03:02:39 UTC 2023
71.7% proc'd: 2156 (58299 Gflop/s) - 2233 (58364 Gflop/s) errors: 0 - 0 temps: 60 C - 61 C
Summary at: Sat Jul 29 03:02:46 UTC 2023
83.3% proc'd: 2541 (58267 Gflop/s) - 2541 (58366 Gflop/s) errors: 0 - 0 temps: 61 C - 61 C
Summary at: Sat Jul 29 03:02:53 UTC 2023
93.3% proc'd: 2849 (58368 Gflop/s) - 2926 (58350 Gflop/s) errors: 0 - 0 temps: 62 C - 61 C
Summary at: Sat Jul 29 03:02:59 UTC 2023
100.0% proc'd: 3080 (58489 Gflop/s) - 3157 (58364 Gflop/s) errors: 0 - 0 temps: 62 C - 61 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done
Tested 2 GPUs:
GPU 0: OK
GPU 1: OK
overclock 0
PS C:\wse_github\gpu-burn> docker run --rm --gpus all gpu_burn
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-8c7dc11e-6825-08c1-f05d-5cff6d4ad6db)
GPU 1: NVIDIA GeForce RTX 4090 (UUID: GPU-511a0768-717f-2b3b-0133-b49b7d315929)
Using compare file: compare.ptx
Burning for 60 seconds.
11.7% proc'd: 308 (60443 Gflop/s) - 231 (57900 Gflop/s) errors: 0 - 0 temps: 57 C - 61 C
Summary at: Sun Jul 30 23:27:20 UTC 2023
23.3% proc'd: 693 (60566 Gflop/s) - 616 (57938 Gflop/s) errors: 0 - 0 temps: 58 C - 61 C
Summary at: Sun Jul 30 23:27:27 UTC 2023
35.0% proc'd: 1001 (60602 Gflop/s) - 1001 (57692 Gflop/s) errors: 0 - 0 temps: 58 C - 61 C
Summary at: Sun Jul 30 23:27:34 UTC 2023
46.7% proc'd: 1386 (60485 Gflop/s) - 1386 (57818 Gflop/s) errors: 0 - 0 temps: 59 C - 61 C
Summary at: Sun Jul 30 23:27:41 UTC 2023
58.3% proc'd: 1848 (60455 Gflop/s) - 1694 (57734 Gflop/s) errors: 0 - 0 temps: 60 C - 61 C
Summary at: Sun Jul 30 23:27:48 UTC 2023
68.3% proc'd: 2156 (60404 Gflop/s) - 2002 (57703 Gflop/s) errors: 0 - 0 temps: 61 C - 61 C
Summary at: Sun Jul 30 23:27:54 UTC 2023
80.0% proc'd: 2541 (60479 Gflop/s) - 2387 (57735 Gflop/s) errors: 0 - 0 temps: 61 C - 61 C
Summary at: Sun Jul 30 23:28:01 UTC 2023
91.7% proc'd: 2926 (60374 Gflop/s) - 2772 (57690 Gflop/s) errors: 0 - 0 temps: 60 C - 61 C
Summary at: Sun Jul 30 23:28:08 UTC 2023
100.0% proc'd: 3234 (60315 Gflop/s) - 3157 (57895 Gflop/s) errors: 0 - 0 temps: 62 C - 55 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 0
Uninitted cublas
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 1 with 24563 MB of memory (22646 MB available, using 20381 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 77 iterations
Freed memory for dev 1
Uninitted cublas
done
Tested 2 GPUs:
GPU 0: OK
GPU 1: OK
micha@13900a MINGW64 /c/wse_github/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: NVIDIA RTX A4000 (UUID: GPU-c585959e-3209-a20e-2522-e9420f268bc8)
Using compare file: compare.ptx
Burning for 60 seconds.
16.7% proc'd: 50 (8805 Gflop/s) errors: 0 temps: 63 C
Summary at: Sat Aug 19 19:49:53 UTC 2023
33.3% proc'd: 150 (10864 Gflop/s) errors: 0 temps: 70 C
Summary at: Sat Aug 19 19:50:03 UTC 2023
45.0% proc'd: 250 (10477 Gflop/s) errors: 0 temps: 75 C
Summary at: Sat Aug 19 19:50:10 UTC 2023
58.3% proc'd: 300 (10344 Gflop/s) errors: 0 temps: 77 C
Summary at: Sat Aug 19 19:50:18 UTC 2023
71.7% proc'd: 400 (10158 Gflop/s) errors: 0 temps: 79 C
Summary at: Sat Aug 19 19:50:26 UTC 2023
83.3% proc'd: 450 (10092 Gflop/s) errors: 0 temps: 80 C
Summary at: Sat Aug 19 19:50:33 UTC 2023
98.3% proc'd: 550 (10000 Gflop/s) errors: 0 temps: 81 C
Summary at: Sat Aug 19 19:50:42 UTC 2023
100.0% proc'd: 600 (9948 Gflop/s) errors: 0 temps: 81 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16375 MB of memory (14897 MB available, using 13407 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 50 iterations
Freed memory for dev 0
Uninitted cublas
done
Tested 1 GPUs:
GPU 0: OK
- Mobile Xeon W-10855M 2.8-4.7GHz version of https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/quadro-rtx-5000-data-sheet-us-nvidia-704120-r4-web.pdf
Benchmark
PS C:\Users\micha> nvidia-smi
Sun Aug 20 20:46:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro RTX 5000 WDDM | 00000000:01:00.0 On | N/A |
| N/A 65C P2 107W / 110W | 14998MiB / 16384MiB | 100% Default |
| | | N/A |
micha@LAPTOP-M4VQDR8K MINGW64 /c/_dev/gpu-burn (master)
$ docker run --rm --gpus all gpu_burn
==========
== CUDA ==
==========
CUDA Version 11.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: Quadro RTX 5000 (UUID: GPU-e180b2a0-1a44-e625-8ac1-551fe7b1ee35)
Using compare file: compare.ptx
Burning for 60 seconds.
18.3% proc'd: 51 (5247 Gflop/s) errors: 0 temps: 61 C
Summary at: Mon Aug 21 00:39:19 UTC 2023
31.7% proc'd: 102 (6796 Gflop/s) errors: 0 temps: 62 C
Summary at: Mon Aug 21 00:39:27 UTC 2023
45.0% proc'd: 153 (6737 Gflop/s) errors: 0 temps: 65 C
Summary at: Mon Aug 21 00:39:35 UTC 2023
58.3% proc'd: 153 (6737 Gflop/s) errors: 0 temps: 66 C
Summary at: Mon Aug 21 00:39:43 UTC 2023
73.3% proc'd: 255 (6646 Gflop/s) errors: 0 temps: 67 C
Summary at: Mon Aug 21 00:39:52 UTC 2023
88.3% proc'd: 306 (6616 Gflop/s) errors: 0 temps: 67 C
Summary at: Mon Aug 21 00:40:01 UTC 2023
100.0% proc'd: 306 (6616 Gflop/s) errors: 0 temps: 68 C
Summary at: Mon Aug 21 00:40:08 UTC 2023
100.0% proc'd: 357 (6607 Gflop/s) errors: 0 temps: 68 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 16383 MB of memory (15085 MB available, using 13576 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 51 iterations
Freed memory for dev 0
Uninitted cublas
done
Tested 1 GPUs:
GPU 0: OK
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
GPU 0: NVIDIA RTX 3500 Ada Generation Laptop GPU (UUID: GPU-25326b0f-ad93-c319-7027-b0029d4aee8e)
Using compare file: compare.ptx
Burning for 60 seconds.
13.3% proc'd: 74 (11539 Gflop/s) errors: 0 temps: 77 C
Summary at: Mon Oct 30 23:44:52 UTC 2023
25.0% proc'd: 111 (11484 Gflop/s) errors: 0 temps: 80 C
Summary at: Mon Oct 30 23:44:59 UTC 2023
36.7% proc'd: 222 (11651 Gflop/s) errors: 0 temps: 81 C
Summary at: Mon Oct 30 23:45:06 UTC 2023
48.3% proc'd: 296 (11519 Gflop/s) errors: 0 temps: 81 C
Summary at: Mon Oct 30 23:45:13 UTC 2023
61.7% proc'd: 370 (11379 Gflop/s) errors: 0 temps: 83 C
Summary at: Mon Oct 30 23:45:21 UTC 2023
75.0% proc'd: 444 (12958 Gflop/s) errors: 0 temps: 85 C
Summary at: Mon Oct 30 23:45:29 UTC 2023
86.7% proc'd: 555 (12996 Gflop/s) errors: 0 temps: 85 C
Summary at: Mon Oct 30 23:45:36 UTC 2023
100.0% proc'd: 629 (13163 Gflop/s) errors: 0 temps: 85 C
Summary at: Mon Oct 30 23:45:44 UTC 2023
100.0% proc'd: 666 (13124 Gflop/s) errors: 0 temps: 85 C
Killing processes with SIGTERM (soft kill)
Using compare file: compare.ptx
Burning for 60 seconds.
Initialized device 0 with 12281 MB of memory (11119 MB available, using 10007 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 37 iterations
Freed memory for dev 0
Uninitted cublas
done
Tested 1 GPUs:
GPU 0: OK
micha@p1gen6 MINGW64 /c/wse_github/gpu-burn (master)
$ nvidia-smi
Mon Oct 30 19:48:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84 Driver Version: 545.84 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 3500 Ada Gene... WDDM | 00000000:01:00.0 Off | Off |
| N/A 51C P8 7W / 102W | 122MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
C:\_dev>nvidia-smi
Mon Aug 21 20:33:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 536.67 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 970 WDDM | 00000000:01:00.0 On | N/A |
| 0% 41C P8 18W / 163W | 546MiB / 4096MiB | 13% Default |
- https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units
- https://askgeek.io/en/gpus/vs/NVIDIA_L4-vs-NVIDIA_NVIDIA-GeForce-RTX-4090
- https://cloud.google.com/nvidia
- https://aws.amazon.com/nvidia/
- Full 400 GBps NVLink
- https://www.coreweave.com/products/hgx-h100
- https://developer.nvidia.com/blog/breaking-mlperf-training-records-with-nvidia-h100-gpus/
- https://nvidianews.nvidia.com/news/nvidia-announces-dgx-gh200-ai-supercomputer
- H3 VMs with H100 GPUs are available on public preview in select regions (us-central1-a, europe-west4-b/c)
gcloud compute instances create gpu-h100-usc1a --project=cuda-obs --zone=us-central1-a --machine-type=h3-standard-88 --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY,subnet=default --maintenance-policy=TERMINATE --provisioning-model=STANDARD --service-account=816469424864-compute@developer.gserviceaccount.com --scopes=https://www.googleapis.com/auth/cloud-platform --tags=http-server,https-server --create-disk=auto-delete=yes,boot=yes,device-name=gpu-h100-usc1a,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230814,mode=rw,size=10,type=projects/cuda-obs/zones/us-central1-a/diskTypes/pd-balanced --no-shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring --labels=goog-ec-src=vm_add-gcloud --reservation-affinity=any
- https://nvidianews.nvidia.com/news/nvidia-global-workstation-manufacturers-to-launch-powerful-systems-for-generative-ai-and-llm-development-content-creation-data-science
- as of Ada generation - no more 112GBps NVLink - https://forums.developer.nvidia.com/t/rtx-a6000-ada-no-more-nv-link-even-on-pro-gpus/230874 and https://aecmag.com/workstations/nvidia-rtx-4000-4500-5000-ada-generation-gpus-launch/
- https://www.nvidia.com/en-us/design-visualization/rtx-a4000/
- as of 40xx no more NVLink
- GCP HPC vs On-Prem HPC - LLM from scratch as part of preparing for Generative AI
- alternate: GCP HPC vs On-Prem HPC - for live drone AI Entity Extraction
Start with the following site - read it in its entirety https://jaykmody.com/blog/gpt-from-scratch/ or https://github.com/lm-sys/FastChat
FROM nvidia/cuda:12.2.0-devel-ubi8
CMD nvidia-smi
docker build -t nvidia-smi .
docker run --rm --gpus all nvidia-smi
Mon Oct 9 20:23:27 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.01 Driver Version: 536.67 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro P1000 On | 00000000:01:00.0 On | N/A |
| N/A 45C P8 N/A / 20W | 543MiB / 4096MiB | 4% Default |
The key to GPU passthrough to docker is the --gpus variable - if you don't set it you will get the following
$ docker run --rm nvidia-smi
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
/bin/sh: nvidia-smi: command not found
See https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html
runtime
docker run -dit --name cuda-runtime nvidia/cuda:11.8.0-runtime-ubi8
or dev
docker run -dit --name cuda-devel nvidia/cuda:11.8.0-devel-ubi8
Ubuntu image
FROM ubuntu:22.04
RUN apt-get update && apt-get -y install sudo
RUN useradd -m docker && echo "docker:docker" | chpasswd && adduser docker sudo
RUN apt-get install curl -y
RUN apt-get install gpg -y
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt-get update
RUN sudo apt-get install -y nvidia-container-toolkit
USER docker
CMD /bin/bash
sudo nvidia-ctk runtime configure --runtime=docker
- https://course.fast.ai/
- Start with Introduction to Generative AI Learning Path at GCP - https://www.cloudskillsboost.google/journeys/118
- Next - Generative AI for Developers Learning Path at GCP - https://www.cloudskillsboost.google/journeys/183
P.256 of Generative Deep Learning 2nd Edition - David Foster https://towardsdatascience.com/how-to-build-an-llm-from-scratch-8c477768f1f9 https://github.com/allenai/allennlp/discussions/5056 https://support.terra.bio/hc/en-us/community/posts/4787320149915-Requester-Pays-Google-buckets-not-asking-for-project-to-bill
- Google Project Gemini https://blog.google/technology/ai/google-io-2023-keynote-sundar-pichai/#ai-products
C4 = Colossal Clean Crawled Corpus start 20231203:0021 - estimate $100 US for gcs egress An average of 300mbps with peaks of 900mbps from the GCP bucket means 800GB x 8 bits = 6400Gbits at .3Gbps = 6hours ~ ETA 36GB in 26 min = 25MB/sec = 200mbps = 11h (possibly limited by the hdd - go directly to NVMe next time
- checked 0845 done
- copy test HDD to HDD no raid -849 ~ 1330 = 4.5h
- HDD to NVMe 1400-1455 - 250Mbps ~1h
- copy test NVMe to NVMe 1456- 4-8 min 3.4-1.4 GB/s (thermal throttling) (990 pro 50% of max 8GB/s)
$93 US for GCS egress
- https://www.neuroscience.cam.ac.uk/directory/profile.php?rea1
- https://www.amazon.com/Programming-Parallel-CUDA-Practical-Guide/dp/1108479537/ref=sr_1_1?qid=1690555406&refinements=p_27%3ARichard+Ansorge&s=books&sr=1-1
- https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus
- https://news.ycombinator.com/item?id=34431056
- https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning-on-the-M1-Pro-with-Apple-Silicon---VmlldzoxMjQ0NjY3
- https://github.com/cloud-quickstart/nanoGPT
- 65 billion parameter LLM on 2 x 4090 https://github.com/kuleshov-group/llmtune
- https://www.linkedin.com/pulse/google-io-updates-training-65b-model-single-gpu-hf-agent-cherukuri
- https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lecs.pdf
- https://github.com/vectara/hallucination-leaderboard
- 20231122 from AlphaSignal - GPT web crawler - https://github.com/BuilderIO/gpt-crawler
- Google Project Gemini - https://blog.google/technology/ai/google-io-2023-keynote-sundar-pichai/#ai-products is multimodal https://www.techopedia.com/definition/multimodal-ai-multimodal-artificial-intelligence - https://medium.com/@bedros-p/gemini-is-coming-to-makersuite-so-are-stubbs-32248f3924aa - https://developers.googleblog.com/2023/09/make-with-makersuite-part1-introduction.html
- Google C4 dataset (800G) of processed "common crawl" - https://github.com/allenai/allennlp/discussions/5056
- https://github.com/ggerganov/llama.cpp
Add TPUv5 capability
- https://cloud.google.com/blog/products/compute/announcing-cloud-tpu-v5e-and-a3-gpus-in-ga
- https://cloud.google.com/blog/products/containers-kubernetes/whats-new-with-gke-at-google-cloud-next
- TPUv5 added to PyTorch https://github.com/pytorch/xla/releases/tag/v2.1.0
- Multislice for multi pod https://cloud.google.com/blog/products/compute/using-cloud-tpu-multislice-to-scale-ai-workloads
- https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html
- Grace Hopper (Ada Lovelace gen) H100
- Ada Lovelace : L40, L40s - https://github.com/GoogleCloudPlatform/pubsec-declarative-toolkit/issues/747
- Ampere
Pascal Volta Tesla Ampere Ada Lovelace