DL- Slurm

Slurm Management

General

So, as a researcher at WPI, I've gotten the chance to use a High-Performance Computing cluster for stress-testing my thesis code (or more commonly, deep learning!).

However, compared to the researchers of this paper, I haven't been lucky enough to use 27,600 GPUs or the Summit supercomputer :/ In any case, I'm hoping to document my SLURM scripts that don't cause the cluster admin to have a meltdown. :D

Getting Started


              # Example slurm script run using sbatch
              #!/usr/bin/env bash

              #SBATCH -J `job`
              #SBATCH -N 1
              #SBATCH -n 12
              #SBATCH -A `account`
              #SBATCH -p `partition`
              #SBATCH -p long
              #SBATCH --mem 50G
              #SBATCH --array=0-59%10
              #SBATCH --output=slurm_messages/output/`job`%A_%a.out
              #SBATCH --error=slurm_messages/error/`job`%A_%a.err
              #SBATCH --mail-type=END
              #SBATCH --mail-user=`email_address`

              set -e

              #Cleanup functions upon exit
              function cleanup() {
                rm -rf $WORKDIR
              }

              trap cleanup EXIT SIGINT SIGTERM

              # Directory setup
              STARTDIR=$(pwd)
              SCRIPTEDDATASET="$STARTDIR/dataset/`location`.pkl"
              IMAGEDATASET_IMAGES="$STARTDIR/dataset/pickled/`image_location`.pkl"
              OTHERDATASETDIR="$STARTDIR/dataset/pickled/`location`/*.pkl"
              DATAOUTDIR="$STARTDIR/`output_location`"
              MYUSER=$(whoami)
              LOCALDIR=/local
              # THISJOB="job_${SLURM_JOB_NAME}"
              # WORKDIR=$LOCALDIR/$MYUSER/$THISJOB

              # Create an array of data file names
              folder_directory=() # Empty array
              for f in $OTHERDATASETDIR  # Append data folder names to the empty array
              do
                folder_directory+=($f)
              done

              # Directory setup continued
              THISFILE="${folder_directory[$SLURM_ARRAY_TASK_ID]}"
              THISNAME="$(cut -d'/' -f8 <<<"$THISFILE")"
              THISUSER="$(cut -d'.' -f1 <<<"$THISNAME")"
              echo ${THISFILE}
              echo ${THISNAME}
              echo ${THISUSER}
              THISJOB="`job_identifier"
              WORKDIR=$LOCALDIR/$MYUSER/$THISJOB

              rm -rf "$WORKDIR" && mkdir -p "$WORKDIR" && cd "$WORKDIR"

              echo "Working on : "${folder_directory[$SLURM_ARRAY_TASK_ID]}
              cp -a $SCRIPTEDDATASET $WORKDIR # Copy scripted dataset as well
              cp -a $IMAGE_DATASET $WORKDIR # And scripted image dataset

              cp -a ${folder_directory[$SLURM_ARRAY_TASK_ID]} $WORKDIR
              echo "Copied tar file to work dir."
              python3.6 $STARTDIR/`code`.py
              echo "Completed"
              cp -a $WORKDIR/*.pkl $DATAOUTDIR
              echo "Copied output file"
              cp -a $WORKDIR/*.pt $DATAOUTDIR
              echo "Copied model files"
              rm -rfv $WORKDIR
              echo "Removed work directory"

Commentary

The code can be broken into three main parts:


                #SBATCH -J `job`
                #SBATCH -N 1
                #SBATCH -n 12
                #SBATCH -A `account`
                #SBATCH -p `partition`
                #SBATCH -p long
                #SBATCH --mem 50G
                #SBATCH --array=0-59%10
                #SBATCH --output=slurm_messages/output/`job`%A_%a.out
                #SBATCH --error=slurm_messages/error/`job`%A_%a.err
                #SBATCH --mail-type=END
                #SBATCH --mail-user=`email_address`

This job uses a Slurm Job array to create 60 jobs with 10 active jobs at any moment. It uses 1 node and 12 cores/CPUs (to request GPUs utilize --gres=gpu:N flag) on the node as well as 50 GB per node. The use of 1 node is usually recommended unless the running script has some method of passing information across nodes. Finally, Slurm pipes the output (usually print()/debug()) to the .out files specified, and also documents any errors in the .err file. %A is replaced by the jobid, and %a is replaced by the array-task-id. When the job is complete, a status update is emailed.


              set -e

              #Cleanup functions upon exit
              function cleanup() {
                rm -rf $WORKDIR
              }

              trap cleanup EXIT SIGINT SIGTERM

The above snippet is good for garbage collection. set -e stops the execution after a error, and trap cleanup removes the working-directory on exit (usually during an error).

The last chunk of code basically deals with the logic that you need: specifying the data input, output directories, home/work directories, and also creating a data structure of N entries with a job for each one (and running the Python script on them). The job here is given an unique name so that an unique working directory can be created.

If the code to be run requires more than just Python (and needs to be installed):

Commonly used packages are pre-installed on the cluster and can be found by a combination of module avail and module add.
Sometimes Python-packages can be installed using pip3 with the --user flag and not as a superuser.
If a custom piece of software (e.g. a simulator) needs to be installed, it gets a little trickier as you have to compile and build with the correct source paths (and also ensure that your ~/.bashrc has the needed linker library/package config/cmake paths). Using an interactive shell session might help with this.

Things to Note

The cluster is setup using a home-node connected to compute-nodes. Do not work in the home-node! Rather, work in a temporary directory /local or /tmp.

Remember to clean-up the working directory and move useful data back to the home directory.

Use a screen on the remote session (so preventing the script from crashing if you log out of ssh).

Make sure you actually need the resources that you are allocating: 1) This is difficult with memory, but easier with number of nodes or CPUs. 2) Based on how the time-share/user priority works in the cluster, requesting less resources could mean that you get bumped up the queue.

Depending on how the cluster is set up (with different processor generations) code that works on some processors would not work on others. Slurm allows you to request processors/GPUs of particular types.

In addition to being able to watch -n5 squeue -u `user` watch the progress of a job, sacct - and its associated flags allow tracking stats like CPUTime or Max. Memory Usage.

Do not annoy the cluster admin!