Skip to content

Commit

Permalink
Added python script changes necessary to run DeepCam on CPU.
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael Bareford authored and Michael Bareford committed Dec 3, 2024
1 parent 84e2628 commit e4fe159
Showing 1 changed file with 57 additions and 1 deletion.
58 changes: 57 additions & 1 deletion docs/user-guide/machine-learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ could be replaced by `module -q load pytorch/1.13.1-gpu` if you are not running

In the script above, we specify four tasks per node, one for each GPU. These tasks are evenly spaced across the node so as to maximise the communications
bandwidth between the host and the GPU devices. Note, PyTorch is not using Cray MPICH for inter-task communications, which is instead being handled by the
ROCm Collective Communications Library (RCCL), hence the `--wireup_method nccl-slurm` option (`nccl-slurm` works as an alias for `rccl-slurm in this context).
ROCm Collective Communications Library (RCCL), hence the `--wireup_method nccl-slurm` option (`nccl-slurm` works as an alias for `rccl-slurm` in this context).

The above job should achieve convergence — an Intersection over Union (IoU) of 0.82 — after 35 epochs or so. Runtime should be around 20-30 minutes.

Expand Down Expand Up @@ -240,6 +240,62 @@ python -m pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.
deactivate
```

In order to run a DeepCam training job, you must first clone the [MLCommons HPC github repo](https://github.com/mlcommons/hpc/tree/main).

```
mkdir ${HOME/home/work}/tests
cd ${HOME/home/work}/tests
git clone https://github.com/mlcommons/hpc.git mlperf-hpc
cd ./mlperf-hpc/deepcam/src/deepCam
```


Next, we need to edit some parts of the DeepCam Python source such that DeepCam is properly integrated with Cray MPICH.

The `init` function defined in `./utils/comm.py` contains an `if` statement that initialises the DeepCam job according
to the selected communications method. You will need to edit the `mpi` branch of this `if` statement as shown below.

```python
...

def init(method, batchnorm_group_size=1):

if method == "nccl-openmpi":

...

elif method == "mpi":
rank = int(os.getenv("SLURM_PROCID"))
world_size = int(os.getenv("SLURM_NTASKS"))
dist.init_process_group(backend = "mpi",
rank = rank,
world_size = world_size)

else:
raise NotImplementedError()

...
```

Second, as we're not running on a GPU platform, we'll need to comment out a statement that calls a GPU-based
synchronisation method, see the `synchronize` method within `./utils/bnstats.py`.

```python
...

def synchronize(self:

if dist.is_initialized():
# sync the device before
#torch.cuda.synchronize()

with torch.no_grad():
...
```


DeepCam can now be run on the CPU nodes using a submission script like the one below.

```bash
Expand Down

0 comments on commit e4fe159

Please sign in to comment.