revisions

acenet-arc · Jun 15, 2024 · 5bbbab3 · 5bbbab3
1 parent 794b2e3
commit 5bbbab3
Showing 1 changed file with 77 additions and 3 deletions.
diff --git a/_episodes/08-openMP-GPU.md b/_episodes/08-openMP-GPU.md
@@ -434,14 +434,13 @@ nvc vadd_gpu.c -O3 -fopenmp -mp=multicore -Minfo=mp
 
 
 ## Getting info about available GPUs
-### Nvidia-smi
+### Using the nvidia-smi command
 ~~~
 nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory -format=csv -l 1 >& gpu.log&
 ~~~
 {:.language-bash}
 
-
-### Getting info about available GPUs programmatically
+### Using OpenMP functions
 ~~~
 #include <stdio.h>
 #include <omp.h>
@@ -497,3 +496,78 @@ srun --gpus-per-node=1 --mem=1000 ./gpuinfo
 
 With nvhpc you'll get 72 teams x 992 threads on the same GPU.
 
+### Parallelizing with OpenACC
+
+~~~
+cp vadd_gpu_template.c vadd_gpu_acc.c
+~~~
+
+#### The *acc kernels* Construct
+Let’s add a single, simple OpenACC directive before the block containing *for* loop nest.
+~~~
+#pragma acc kernels
+~~~
+{:.language-c}
+
+This pragma tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case).
+
+#### Compiling with Nvidia HPC SDK
+Load the Nvidia HPC module:
+~~~
+module load StdEnv/2020 nvhpc/22
+~~~
+{:.language-bash}
+
+To compile the code with OpenACC we need to add the *-acc* flag
+
+~~~
+nvc vadd_gpu_acc.c -O3 -acc 
+~~~
+{:.language-bash}
+
+Let’s run the program. It runs slow compared even to our serial CPU code. 
+
+Were the loops parallelized? 
+- enable printing information about the parallelization ( add the option *-Minfo=accel*) and recompile
+
+~~~
+nvc vadd_gpu_acc.c -O3 -acc -Minfo=accel
+~~~
+{:.language-bash}
+
+
+~~~
+     33, Generating implicit copyin(A[:100000000],B[:100000000]) [if not already present]
+         Generating implicit copy(sum) [if not already present]
+         Generating implicit copyout(C[:100000000]) [if not already present]
+     35, Complex loop carried dependence of B->,A-> prevents parallelization
+         Loop carried dependence due to exposed use of C prevents parallelization
+         Complex loop carried dependence of C-> prevents parallelization
+         Accelerator serial kernel generated
+         Generating NVIDIA GPU code
+         35, #pragma acc loop seq
+         36, #pragma acc loop seq
+     35, Complex loop carried dependence of C-> prevents parallelization
+     36, Complex loop carried dependence of B->,A->,C-> prevents parallelization
+~~~
+
+#### The *__restrict* Type Qualifier
+Our data arrays (A, B, C) are dynamically allocated and are accessed via pointers. Pointers may create data dependency because:
+
+- More than one pointer can access the same chunk of memory in the C language.
+- Arithmetic operations on pointers are allowed.
+
+Using the *__restrict* qualifier to a pointer ensures that it is safe and enables better optimization by the compiler.   
+
+~~~
+float * __restrict A;
+float * __restrict B;
+float * __restrict C;
+~~~
+{:.language-c}
+
+After this modification The compiler parallelizes the loop and even automatically determines what data should be copied to the device and what reduction operations are needed.
+
+- It is important to examine optimixzation reports
+- It is best to add explicit data transfer direcives and reduction clauses to ensure your program will work correctly when compiled with different compilers.
+