Skip to content

Commit

Permalink
Merge pull request #40 from SeisSol/zihua/testing
Browse files Browse the repository at this point in the history
remote Jupyter Lab on Frontera
  • Loading branch information
Thomas-Ulrich authored May 22, 2024
2 parents 6dcb7c6 + b1ba9b0 commit e31b2b2
Show file tree
Hide file tree
Showing 12 changed files with 335 additions and 79 deletions.
69 changes: 63 additions & 6 deletions frontera.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,18 @@ Then execute:

```
module load tacc-apptainer
singularity pull -F docker://seissol/training:latest
singularity build -f my-training.sif singularity.def
singularity run my-training.sif
apptainer pull -F docker://seissol/training:hps-2024-frontera
apptainer build -f my-training.sif singularity.def
apptainer run my-training.sif
```

You can also use the automatically generated container after pulling the docker container

```
module load tacc-apptainer
apptainer pull -F docker://seissol/training:hps-2024-frontera
apptainer run training_latest.sif
ln -s /absolute/path/to/training_latest.sif ~/my-training.sif
```

You can abort the jupyter lab with Ctrl-C, confirm with `y`.
Expand All @@ -27,10 +36,22 @@ To run the TPV13 scenario, you should:

```
cd seissol-training/tpv13
mpirun singularity run ~/my-training.sif gmsh -3 tpv13_training.geo
mpirun singularity run ~/my-training.sif pumgen -s msh2 tpv13_training.msh
OMP_NUM_THREADS=28 mpirun -n 2 singularity run ~/my-training.sif seissol parameters.par
mpirun apptainer run ~/my-training.sif gmsh -3 tpv13_training.geo
mpirun apptainer run ~/my-training.sif pumgen -s msh2 tpv13_training.msh
OMP_NUM_THREADS=26 mpirun -n 2 apptainer run ~/my-training.sif seissol parameters.par
```

To run the Northridge scenario, you should:

```
cd seissol-training/northridge
mpirun apptainer run ~/my-training.sif pumgen -s msh2 mesh_northridge.msh
apptainer run ~/my-training.sif rconv -i northridge_resampled.srf -o northridge_resampled.nrf -x visualization.xdmf -m "+proj=tmerc +datum=WGS84 +k=0.9996 +lon_0=-118.5150 +lat_0=34.3440 +axis=enu"
OMP_NUM_THREADS=26 mpirun -n 2 apptainer run ~/my-training.sif seissol parameters.par
```
You can change `seissol` to `SeisSol_Release_dhsw_4_viscoelastic2` if you want to account for attenuation (https://seissol.readthedocs.io/en/latest/attenuation.html) instead of assuming a fully elastic rheology.

In Section `Interacting with Frontera from local machine`, we will also show how you may interact with Frontera from your local machine with a Jupyter Lab.

## Expected runtimes

Expand All @@ -46,6 +67,42 @@ Sulawesi LSW | 6 min
Sulawesi RS | 6 min
TPV13 | 12 s

## Interacting with Frontera from local machine
We present a workflow for running a Jupyter Lab remotely on Frontera, while interacting with it on your local machine.

You can take the following steps:

Step 1: change `SHARED_PATH="/your/path/to/container/"` in line 75 of `job.jupyter` to the path where your sigularity container is built.

Step 2: Run
```
sbatch -A <your_project> job.jupyter
```

Step 3: Check the job status with
```
squeue -u $USER
```

Step 4: Once the status changes from `PD` to `R`, you will find the job output in a generated file `jupyter.out`.

Step 5: Check the last few lines with
```
tail -f jupyter.out
```
wait a few seconds until you get in `jupyter.out` something like:
```
TACC: got login node jupyter port 60320
TACC: created reverse ports on Frontera logins
TACC: Your jupyter notebook server is now running at https://frontera.tacc.utexas.edu:60320/?token=2e0fade1f8b1ce00b303a7e97dd962c5cd10c17f03a245e8c761ca7e1d5e1597
```
(and then Ctrl+C to stop monitoring the contents of `jupyter.out`)

Step 6: Paste the link to your local browser, you will have access to the Frontera environment on your local machine.
```
https://frontera.tacc.utexas.edu:60320/?token=2e0fade1f8b1ce00b303a7e97dd962c5cd10c17f03a245e8c761ca7e1d5e1597
```

## Visualization

You can directly visualize the results on Frontera:
Expand Down
238 changes: 238 additions & 0 deletions job.jupyter
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
#!/bin/bash
#
#-----------------------------------------------------------------------------
# This script was generated automatically by the TACC Analytic Portal (TAP)
#
# This TAP job script is designed to create a jupyter notebook session on
# remote nodes through the SLURM batch system. Once the job
# is scheduled, check the output of your job (which by default is
# stored in your home directory in a file named jupyter.out)
# and it will tell you the port number that has been setup for you so
# that you can attach via a separate web browser to any remote login node
#
# Note: you can fine tune the SLURM submission variables below as
# needed. Typical items to change are the runtime limit, location of
# the job output, and the allocation project to submit against (it is
# commented out for now, but is required if you have multiple
# allocations).
#
#-----------------------------------------------------------------------------
#
#SBATCH -J tap_jupyter # Job name
#SBATCH -o jupyter.out # Name of stdout output file (%j expands to jobId)
#SBATCH -p development # Queue name
#SBATCH -N 1 # Total number of nodes requested
#SBATCH -n 2 # Total number of mpi tasks requested
#SBATCH -t 02:00:00 # Run time (hh:mm:ss)
#
#
#--------------------------------------------------------------------------

#--------------------------------------------------------------------------
# ---- You normally should not need to edit anything below this point -----
#--------------------------------------------------------------------------
#
# last update: pnav 20221013

echo "TACC: job ${SLURM_JOB_ID} execution at: $(date)"

TAP_FUNCTIONS="/share/doc/slurm/tap_functions"
if [ -f ${TAP_FUNCTIONS} ]; then
. ${TAP_FUNCTIONS}
else
echo "TACC:"
echo "TACC: ERROR - could not find TAP functions file: ${TAP_FUNCTIONS}"
echo "TACC: ERROR - Please submit a consulting ticket at the TACC user portal"
echo "TACC: ERROR - https://portal.tacc.utexas.edu/tacc-consulting/-/consult/tickets/create"
echo "TACC:"
echo "TACC: job $SLURM_JOB_ID execution finished at: `date`"
exit 1
fi

# our node name
NODE_HOSTNAME=$(hostname -s)
echo "TACC: running on node ${NODE_HOSTNAME}"

echo "TACC: unloading xalt"
module unload xalt

echo "MNMN: install python libraries"
module load python3/3.9.2
export PATH="$PATH:$HOME/.local"
# pip install --user obspy cartopy

# urllib compatibility
pip uninstall -y urllib3
pip install --user 'urllib3<2.0'
pip install vtk pyvista

echo "MNMN: load appatainer module"
module load tacc-apptainer

echo "MNMN: prepare the custom image"

#
SHARED_PATH="/your/path/to/container/"
SIF_NAME="training_latest.sif"

if [ ! -f $SIF_NAME ]; then
if [ ! -f $SHARED_PATH/$SIF_NAME ]; then
# load the image if no image exists in the shared directory
echo "MNMN: pull the appatainer image"
apptainer pull -F docker://seissol/training:latest
else
# create symlink to the shared directory
echo "MNMN: create symlink to the shared directory"
ln -s $SHARED_PATH/$SIF_NAME $SIF_NAME
fi
fi

# use jupyter-lab if it exists, otherwise jupyter-notebook
JUPYTER_BIN=$(which jupyter-lab 2> /dev/null)

if [ -z "${JUPYTER_BIN}" ]; then
JUPYTER_BIN=$(which jupyter-notebook 2> /dev/null)
if [ -z "${JUPYTER_BIN}" ]; then
echo "TACC: ERROR - could not find jupyter install"
echo "TACC: loaded modules below"
module list
echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
exit 1
else
JUPYTER_SERVER_APP="NotebookApp"
fi
else
JUPYTER_SERVER_VERSION=$(${JUPYTER_BIN} --version)
if [ ${JUPYTER_SERVER_VERSION%%.*} -lt 3 ]; then
JUPYTER_SERVER_APP="NotebookApp"
else
JUPYTER_SERVER_APP="ServerApp"
fi
fi
echo "TACC: using jupyter binary ${JUPYTER_BIN}"


if $(echo ${JUPYTER_BIN} | grep -qve '^/opt') ; then
echo "TACC: WARNING - non-system python detected. Script may not behave as expected"
fi

NB_SERVERDIR=${HOME}/.jupyter
IP_CONFIG=${NB_SERVERDIR}/jupyter_notebook_config.py

# make .jupyter dir for logs
mkdir -p ${NB_SERVERDIR}

mkdir -p ${HOME}/.tap # this should exist at this point, but just in case...
TAP_LOCKFILE=${HOME}/.tap/.${SLURM_JOB_ID}.lock
TAP_CERTFILE=${HOME}/.tap/.${SLURM_JOB_ID}

# bail if we cannot create a secure session
if [ ! -f ${TAP_CERTFILE} ]; then
echo "TACC: ERROR - could not find TLS cert for secure session"
echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
exit 1
fi

# bail if we cannot create a token for the session
TAP_TOKEN=$(tap_get_token)
if [ -z "${TAP_TOKEN}" ]; then
echo "TACC: ERROR - could not generate token for notebook"
echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
exit 1
fi
echo "TACC: using token ${TAP_TOKEN}"

# create the tap jupyter config if needed
TAP_JUPYTER_CONFIG="${HOME}/.tap/jupyter_config.py"
if [ ${JUPYTER_SERVER_APP} == "NotebookApp" ]; then
cat <<- EOF > ${TAP_JUPYTER_CONFIG}
# Configuration file for TAP jupyter-notebook
import ssl
c = get_config()
c.IPKernelApp.pylab = "inline" # if you want plotting support always
c.NotebookApp.ip = "0.0.0.0"
c.NotebookApp.port = 5902
c.NotebookApp.open_browser = False
c.NotebookApp.allow_origin = u"*"
c.NotebookApp.ssl_options={"ssl_version": ssl.PROTOCOL_TLSv1_2}
c.NotebookApp.mathjax_url = u"https://cdn.mathjax.org/mathjax/latest/MathJax.js"
EOF
else
cat <<- EOF > ${TAP_JUPYTER_CONFIG}
# Configuration file for TAP jupyter-notebook
import ssl
c = get_config()
c.IPKernelApp.pylab = "inline" # if you want plotting support always
c.ServerApp.ip = "0.0.0.0"
c.ServerApp.port = 5902
c.ServerApp.open_browser = False
c.ServerApp.allow_origin = u"*"
c.ServerApp.ssl_options={"ssl_version": ssl.PROTOCOL_TLSv1_2}
c.NotebookApp.mathjax_url = u"https://cdn.mathjax.org/mathjax/latest/MathJax.js"
EOF
fi

# launch jupyter
JUPYTER_LOGFILE=${NB_SERVERDIR}/${NODE_HOSTNAME}.log
JUPYTER_ARGS="--certfile=$(cat ${TAP_CERTFILE}) --config=${TAP_JUPYTER_CONFIG} --${JUPYTER_SERVER_APP}.token=${TAP_TOKEN}"
echo "TACC: using jupyter command: ${JUPYTER_BIN} ${JUPYTER_ARGS}"
nohup ${JUPYTER_BIN} ${JUPYTER_ARGS} &> ${JUPYTER_LOGFILE} && rm ${TAP_LOCKFILE} &
#sleep 120 && rm -f $(cat ${TAP_CERTFILE}) && rm -f ${TAP_CERTFILE} &
JUPYTER_PID=$!
LOCAL_PORT=5902

LOGIN_PORT=$(tap_get_port)
echo "TACC: got login node jupyter port ${LOGIN_PORT}"

JUPYTER_URL="https://frontera.tacc.utexas.edu:${LOGIN_PORT}/?token=${TAP_TOKEN}"

# verify jupyter is up. if not, give one more try, then bail
if ! $(ps -fu ${USER} | grep ${JUPYTER_BIN} | grep -qv grep) ; then
# sometimes jupyter has a bad day. give it another chance to be awesome.
echo "TACC: first jupyter launch failed. Retrying..."
nohup ${JUPYTER_BIN} ${JUPYTER_ARGS} &> ${JUPYTER_LOGFILE} && rm ${TAP_LOCKFILE} &
fi
if ! $(ps -fu ${USER} | grep ${JUPYTER_BIN} | grep -qv grep) ; then
# jupyter will not be working today. sadness.
echo "TACC: ERROR - jupyter failed to launch"
echo "TACC: ERROR - this is often due to an issue in your python or conda environment"
echo "TACC: ERROR - jupyter logfile contents:"
cat ${JUPYTER_LOGFILE}
echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
exit 1
fi

# create reverse tunnel port to login nodes. Make one tunnel for each login so the user can just
# connect to frontera.tacc.utexas.edu
NUM_LOGINS=4
for i in $(seq ${NUM_LOGINS}); do
ssh -q -f -g -N -R ${LOGIN_PORT}:${NODE_HOSTNAME}:${LOCAL_PORT} login${i}
done
if [ $(ps -fu ${USER} | grep ssh | grep login | grep -vc grep) != ${NUM_LOGINS} ]; then
# jupyter will not be working today. sadness.
echo "TACC: ERROR - ssh tunnels failed to launch"
echo "TACC: ERROR - this is often due to an issue with your ssh keys"
echo "TACC: ERROR - undo any recent mods in ${HOME}/.ssh"
echo "TACC: ERROR - or submit a TACC consulting ticket with this error"
echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
exit 1
fi
echo "TACC: created reverse ports on Frontera logins"

echo "TACC: Your jupyter notebook server is now running at ${JUPYTER_URL}"

# spin on lock until file is removed
TAP_CONNECTION=${HOME}/.tap/.${SLURM_JOB_ID}.url
echo ${JUPYTER_URL} > ${TAP_CONNECTION}
echo $(date) > ${TAP_LOCKFILE}
while [ -f ${TAP_LOCKFILE} ]; do
sleep 1
done

# job is done!
echo "TACC: release port returned $(tap_release_port ${LOGIN_PORT})"

# wait a brief moment so jupyter can clean up after itself
sleep 1

echo "TACC: job ${SLURM_JOB_ID} execution finished at: $(date)"
4 changes: 3 additions & 1 deletion kaikoura/Kaikoura.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,9 @@
"metadata": {},
"outputs": [],
"source": [
"!OMP_NUM_THREADS=4 mpirun -n 1 SeisSol_Release_dhsw_4_elastic parametersLSW.par"
"!OMP_NUM_THREADS=4 mpirun -n 1 SeisSol_Release_dhsw_4_elastic parametersLSW.par\n",
"# on Frontera with apptainer, replace with:\n",
"# !SEISSOL_COMMTHREAD=0 OMP_NUM_THREADS=28 mpirun -n 2 apptainer run {\"~/my-training.sif\"} SeisSol_Release_dhsw_4_elastic parametersLSW.par"
]
},
{
Expand Down
Loading

0 comments on commit e31b2b2

Please sign in to comment.