Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of MIOPEN_USER_DB_PATH for training speedup in sequential jobs settings #3322

Open
formiel opened this issue Oct 18, 2024 · 14 comments
Open

Comments

@formiel
Copy link

formiel commented Oct 18, 2024

Hello,

I would like to ask if we can use MIOPEN_USER_DB_PATH to accelerate model training in a sequential job setting, where each job starts after the previous one has finished . As I checked the documentation, it is said that:

System FindDb can be cached into memory and may dramatically increase performance.

In my experiments, I observed a gradual speedup during the first run of model training as follows:

  • Step 0 - 5k: took 545 minutes
  • Step 5k - 10k: took 373 minutes
  • Step 10k - 15k: took 300 minutes
  • Step 15k - 20k: took 162 minutes
  • Step 20k - 25k: took 137 minutes
  • Step 25k - 30k: took 127 minutes
  • Step 30k - 35k: took 110 minutes.

However, I need to setup the jobs sequentially due to time constraints imposed by SLURM. During the second run, the model experienced similar phases as the first run, with step 35k - 40k taking 545 minutes and so on.

After reading a previous comment and the documentation, I wonder if setting the MIOPEN_USER_DB_PATH specific to each job (based on the experiment name) and SLURM process ID as below could help leverage the optimized convolutional kernels found in previous runs to make training faster:

export MIOPEN_USER_DB_PATH="$SCRATCH/tmp/miopen-cache/${CONFIG}_ngpus${NGPUs}/$SLURM_PROCID"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
mkdir -p ${MIOPEN_USER_DB_PATH}

If not, is there any way to sustain the performance observed in the previous run, such that the first 5k step of the next job takes 110 minutes please? As the same training on A100 takes 60 minutes for each 5K steps, the average run on MI250x as shown above would take around 250 minutes, which is more than 4 times longer than on A100.

Many thanks in advance for your response!

@formiel
Copy link
Author

formiel commented Oct 18, 2024

FYI, my experiments are run on 32 GPUs MI250x and my run.slurm is as below:

#!/bin/bash


module purge && module load cpe/23.12 craype-accel-amd-gfx90a craype-x86-trento PrgEnv-gnu amd-mixed/6.0.0 cray-python/3.10.10 aws-ofi-rccl && module list

source ~/env/py310_rocm60_torch241_main/bin/activate

export PYTHONPATH=$PYTHONPATH:$HOME/code/fairspeech_py310_rocm60_torch241_main
export FAIRSEQ=$HOME/code/fairspeech_py310_rocm60_torch241_main

export ROCBLAS_INTERNAL_FP16_ALT_IMPL=1
export MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL=1

export MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
export MIOPEN_CUSTOM_CACHE_DIR="${MIOPEN_USER_DB_PATH}"
mkdir -p ${MIOPEN_USER_DB_PATH}

export MIOPEN_DEBUG_AMD_ROCM_PRECOMPILED_BINARIES=0 # disable use of precompiled binaries for certain operations, including the Winograd convolution algorithm
export MIOPEN_DEBUG_CONV_FFT=0 # disable FFT convolution algorithm
export MIOPEN_DEBUG_CONV_DIRECT=0 # disable Direct convolution algorithm
# export MIOPEN_DEBUG_GCN_ASM_KERNELS=0 # disable hand-tuned asm. kernels for Direct convolution algorithm. Fall-back to kernels written in high-level language.

# echo des commandes lancees
set -x

NCPU_PER_PROC=$((NCPUS_PER_TASK / NGPUS_PER_NODE))
export OMP_NUM_THREADS=${NCPU_PER_PROC}
# export PYTHONPATH=$HOME/code/fairspeech # comment out to use load_torchxx.sh in Adastra

echo "Total ${NGPUs} GPUs requested, ${NGPUS_PER_NODE} GPUs per node"

MASTER_PORT=$(shuf -i 20000-65000 -n 1)

# export LOGLEVEL="INFO"
# eval "TMPDIR=$SCRATCH/tmp OMP_NUM_THREADS=${NCPUS_PER_TASK} srun ${COMMAND}"
eval 'srun \
--cpu-bind=none \
--mem-bind=none \
--label \
-- torchrun --nnodes="${SLURM_NNODES}" --nproc_per_node="${NGPUS_PER_NODE}" \
--rdzv-id="${SLURM_JOBID}" \
--rdzv-backend=c10d \
--rdzv-endpoint="$(scontrol show hostname ${SLURM_NODELIST} | head -n 1):${MASTER_PORT}" \
--max-restarts="1" \
${COMMAND}'

@ppanchad-amd
Copy link

Hi @formiel. Internal ticket has been created to assist with your issue. Thanks!

@huanrwan-amd
Copy link
Contributor

Hi @formiel , are you able to run your run.slurm? and can you post the results?
According to https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/perfdb.html, you can try system PerDb (which is "A system-wide storage that holds pre-run values for the most applicable configurations."), note that "User PerfDb always takes precedence over System PerfDb."

@formiel
Copy link
Author

formiel commented Nov 5, 2024

Hello @huanrwan-amd,

Thank you very much for your reply!! I encountered an error when setting MIOPEN_USER_DB_PATH to a local disk space in order to reuse the optimized kernels for subsequent runs. My colleague @etiennemlb suggested a solution: running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.

Due to the time and resource constraints, I’m unable to try this solution at the moment, but I’ll test it when possible and share the results with you later.

@huanrwan-amd
Copy link
Contributor

Hi @formiel, thank you for your response. I will close the ticket for now.

@huanrwan-amd huanrwan-amd closed this as not planned Won't fix, can't repro, duplicate, stale Nov 5, 2024
@netw0rkf10w
Copy link

@huanrwan-amd Why close the issue? Isn't it a big issue if kernel caches cannot be used across sequential jobs?

@huanrwan-amd
Copy link
Contributor

Hi @netw0rkf10w, this ticket is to address a specific issue for the originator. If you want to know more about kernel caches database, please refer https://rocm.docs.amd.com/projects/MIOpen/en/latest/conceptual/cache.html . Thanks.

@etiennemlb
Copy link

I agree you can't just close an issue like that, there is a significant performance issue and that is not fine. I would guess that AMD wants its platform to perform well on MI250X. If @formiel can't use MI250X for now, you could at least ask for a reproducer and work on your side.

Just to be clear, @huanrwan-amd, this is a discussion on the behavior of the cache db, and the doc you gave is scarce.

As @formiel said:

running on a single GPU to save the cached values to local disk space, then using these saved outputs for a full training run on multiple GPUs. However, we're uncertain whether the subsequent job will only read from this directory or potentially overwrite it.

is that sound or wishful thinking ?

@huanrwan-amd huanrwan-amd reopened this Nov 5, 2024
@huanrwan-amd
Copy link
Contributor

huanrwan-amd commented Nov 5, 2024

Hi @etiennemlb,

Thanks for your comments. I’ve reopened the ticket as requested.

  1. Could you please provide more information on the software, specifically the OS version and ROCm version? The latest release for ROCm is 6.2.2. ROCm Releases
  2. Please enable logs as described here: Enable Logs

As mentioned in the documentation, the cache database has two types: system PerfDb (.kdb) and user PerfDb (.ukdb), located under /$HOME/.cache/miopen/ or another location set by the user. When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.

@etiennemlb
Copy link

Thanks, @huanrwan-amd.

The ROCm version is 6.0.0. But @formiel tested using pytorch+rocm6.1 and pytorch+rocm6.2. AFAIK, the problem was always present.
The OS is RHEL 8.9.

When a kernel is needed, MIOpen first checks if it exists in the database. If it does, the built kernel is reused. If not, MIOpen builds the kernel at runtime using hiprtc and adds it to the database. In this context, you can reuse those database files.

@formiel , from that quote, Id say our guess could be right.
@huanrwan-amd is there a way to hard fail/stop the program if a kernel is not found in the database ? This way if you could ensure that all the kernels are "precompiled" for a given workload then for production run you could ensure you will never compile kernels.

@huanrwan-amd is the "enable logs" you mention based on:?

export MIOPEN_ENABLE_LOGGING=1
export MIOPEN_ENABLE_LOGGING_CMD=1
export MIOPEN_LOG_LEVEL=6

@huanrwan-amd
Copy link
Contributor

huanrwan-amd commented Nov 5, 2024

Hi @etiennemlb,

is there a way to hard fail/stop the program if a kernel is not found in the database?

I would suggest update to ROCm 6.2.2 and recording the logs first.
Yes, you could have more details by adding MIOPEN_ENABLE_SQLITE for db:

export MIOPEN_ENABLE_SQLITE=1&&
export MIOPEN_LOG_LEVEL=7

@huanrwan-amd
Copy link
Contributor

Hi @etiennemlb and @formiel ,
For the cache db path as in the run.slurm mentioned by @formiel :
MIOPEN_USER_DB_PATH="/tmp/${USER}-miopen-cache-${SLURM_JOB_ID}"
Please note that each SLURM_JOB_ID will generate a *.ukdb file.

@huanrwan-amd
Copy link
Contributor

Hi @formiel and @etiennemlb , any update from your side? Thanks.

@etiennemlb
Copy link

As soon as I find the time to deep dive again into his issue, I'll publish my results.

In the meantime, you should be able to reproduce the issue using the script given in this issue: #3310 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants