diff --git a/_episodes/08-openMP-GPU.md b/_episodes/08-openMP-GPU.md index 2bccfc6..400ddd6 100644 --- a/_episodes/08-openMP-GPU.md +++ b/_episodes/08-openMP-GPU.md @@ -434,14 +434,13 @@ nvc vadd_gpu.c -O3 -fopenmp -mp=multicore -Minfo=mp ## Getting info about available GPUs -### Nvidia-smi +### Using the nvidia-smi command ~~~ nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory -format=csv -l 1 >& gpu.log& ~~~ {:.language-bash} - -### Getting info about available GPUs programmatically +### Using OpenMP functions ~~~ #include #include @@ -497,3 +496,78 @@ srun --gpus-per-node=1 --mem=1000 ./gpuinfo With nvhpc you'll get 72 teams x 992 threads on the same GPU. +### Parallelizing with OpenACC + +~~~ +cp vadd_gpu_template.c vadd_gpu_acc.c +~~~ + +#### The *acc kernels* Construct +Let’s add a single, simple OpenACC directive before the block containing *for* loop nest. +~~~ +#pragma acc kernels +~~~ +{:.language-c} + +This pragma tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case). + +#### Compiling with Nvidia HPC SDK +Load the Nvidia HPC module: +~~~ +module load StdEnv/2020 nvhpc/22 +~~~ +{:.language-bash} + +To compile the code with OpenACC we need to add the *-acc* flag + +~~~ +nvc vadd_gpu_acc.c -O3 -acc +~~~ +{:.language-bash} + +Let’s run the program. It runs slow compared even to our serial CPU code. + +Were the loops parallelized? +- enable printing information about the parallelization ( add the option *-Minfo=accel*) and recompile + +~~~ +nvc vadd_gpu_acc.c -O3 -acc -Minfo=accel +~~~ +{:.language-bash} + + +~~~ + 33, Generating implicit copyin(A[:100000000],B[:100000000]) [if not already present] + Generating implicit copy(sum) [if not already present] + Generating implicit copyout(C[:100000000]) [if not already present] + 35, Complex loop carried dependence of B->,A-> prevents parallelization + Loop carried dependence due to exposed use of C prevents parallelization + Complex loop carried dependence of C-> prevents parallelization + Accelerator serial kernel generated + Generating NVIDIA GPU code + 35, #pragma acc loop seq + 36, #pragma acc loop seq + 35, Complex loop carried dependence of C-> prevents parallelization + 36, Complex loop carried dependence of B->,A->,C-> prevents parallelization +~~~ + +#### The *__restrict* Type Qualifier +Our data arrays (A, B, C) are dynamically allocated and are accessed via pointers. Pointers may create data dependency because: + +- More than one pointer can access the same chunk of memory in the C language. +- Arithmetic operations on pointers are allowed. + +Using the *__restrict* qualifier to a pointer ensures that it is safe and enables better optimization by the compiler. + +~~~ +float * __restrict A; +float * __restrict B; +float * __restrict C; +~~~ +{:.language-c} + +After this modification The compiler parallelizes the loop and even automatically determines what data should be copied to the device and what reduction operations are needed. + +- It is important to examine optimixzation reports +- It is best to add explicit data transfer direcives and reduction clauses to ensure your program will work correctly when compiled with different compilers. +