Skip to content

Commit

Permalink
revisions
Browse files Browse the repository at this point in the history
  • Loading branch information
ssvassiliev committed Jun 15, 2024
1 parent 794b2e3 commit 5bbbab3
Showing 1 changed file with 77 additions and 3 deletions.
80 changes: 77 additions & 3 deletions _episodes/08-openMP-GPU.md
Original file line number Diff line number Diff line change
Expand Up @@ -434,14 +434,13 @@ nvc vadd_gpu.c -O3 -fopenmp -mp=multicore -Minfo=mp


## Getting info about available GPUs
### Nvidia-smi
### Using the nvidia-smi command
~~~
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory -format=csv -l 1 >& gpu.log&
~~~
{:.language-bash}


### Getting info about available GPUs programmatically
### Using OpenMP functions
~~~
#include <stdio.h>
#include <omp.h>
Expand Down Expand Up @@ -497,3 +496,78 @@ srun --gpus-per-node=1 --mem=1000 ./gpuinfo

With nvhpc you'll get 72 teams x 992 threads on the same GPU.

### Parallelizing with OpenACC

~~~
cp vadd_gpu_template.c vadd_gpu_acc.c
~~~

#### The *acc kernels* Construct
Let’s add a single, simple OpenACC directive before the block containing *for* loop nest.
~~~
#pragma acc kernels
~~~
{:.language-c}

This pragma tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case).

#### Compiling with Nvidia HPC SDK
Load the Nvidia HPC module:
~~~
module load StdEnv/2020 nvhpc/22
~~~
{:.language-bash}

To compile the code with OpenACC we need to add the *-acc* flag

~~~
nvc vadd_gpu_acc.c -O3 -acc
~~~
{:.language-bash}

Let’s run the program. It runs slow compared even to our serial CPU code.

Were the loops parallelized?
- enable printing information about the parallelization ( add the option *-Minfo=accel*) and recompile

~~~
nvc vadd_gpu_acc.c -O3 -acc -Minfo=accel
~~~
{:.language-bash}


~~~
33, Generating implicit copyin(A[:100000000],B[:100000000]) [if not already present]
Generating implicit copy(sum) [if not already present]
Generating implicit copyout(C[:100000000]) [if not already present]
35, Complex loop carried dependence of B->,A-> prevents parallelization
Loop carried dependence due to exposed use of C prevents parallelization
Complex loop carried dependence of C-> prevents parallelization
Accelerator serial kernel generated
Generating NVIDIA GPU code
35, #pragma acc loop seq
36, #pragma acc loop seq
35, Complex loop carried dependence of C-> prevents parallelization
36, Complex loop carried dependence of B->,A->,C-> prevents parallelization
~~~

#### The *__restrict* Type Qualifier
Our data arrays (A, B, C) are dynamically allocated and are accessed via pointers. Pointers may create data dependency because:

- More than one pointer can access the same chunk of memory in the C language.
- Arithmetic operations on pointers are allowed.

Using the *__restrict* qualifier to a pointer ensures that it is safe and enables better optimization by the compiler.

~~~
float * __restrict A;
float * __restrict B;
float * __restrict C;
~~~
{:.language-c}

After this modification The compiler parallelizes the loop and even automatically determines what data should be copied to the device and what reduction operations are needed.

- It is important to examine optimixzation reports
- It is best to add explicit data transfer direcives and reduction clauses to ensure your program will work correctly when compiled with different compilers.

0 comments on commit 5bbbab3

Please sign in to comment.