Getting Started
# Example slurm script run using sbatch
#!/usr/bin/env bash
#SBATCH -J `job`
#SBATCH -N 1
#SBATCH -n 12
#SBATCH -A `account`
#SBATCH -p `partition`
#SBATCH -p long
#SBATCH --mem 50G
#SBATCH --array=0-59%10
#SBATCH --output=slurm_messages/output/`job`%A_%a.out
#SBATCH --error=slurm_messages/error/`job`%A_%a.err
#SBATCH --mail-type=END
#SBATCH --mail-user=`email_address`
set -e
#Cleanup functions upon exit
function cleanup() {
rm -rf $WORKDIR
}
trap cleanup EXIT SIGINT SIGTERM
# Directory setup
STARTDIR=$(pwd)
SCRIPTEDDATASET="$STARTDIR/dataset/`location`.pkl"
IMAGEDATASET_IMAGES="$STARTDIR/dataset/pickled/`image_location`.pkl"
OTHERDATASETDIR="$STARTDIR/dataset/pickled/`location`/*.pkl"
DATAOUTDIR="$STARTDIR/`output_location`"
MYUSER=$(whoami)
LOCALDIR=/local
# THISJOB="job_${SLURM_JOB_NAME}"
# WORKDIR=$LOCALDIR/$MYUSER/$THISJOB
# Create an array of data file names
folder_directory=() # Empty array
for f in $OTHERDATASETDIR # Append data folder names to the empty array
do
folder_directory+=($f)
done
# Directory setup continued
THISFILE="${folder_directory[$SLURM_ARRAY_TASK_ID]}"
THISNAME="$(cut -d'/' -f8 <<<"$THISFILE")"
THISUSER="$(cut -d'.' -f1 <<<"$THISNAME")"
echo ${THISFILE}
echo ${THISNAME}
echo ${THISUSER}
THISJOB="`job_identifier"
WORKDIR=$LOCALDIR/$MYUSER/$THISJOB
rm -rf "$WORKDIR" && mkdir -p "$WORKDIR" && cd "$WORKDIR"
echo "Working on : "${folder_directory[$SLURM_ARRAY_TASK_ID]}
cp -a $SCRIPTEDDATASET $WORKDIR # Copy scripted dataset as well
cp -a $IMAGE_DATASET $WORKDIR # And scripted image dataset
cp -a ${folder_directory[$SLURM_ARRAY_TASK_ID]} $WORKDIR
echo "Copied tar file to work dir."
python3.6 $STARTDIR/`code`.py
echo "Completed"
cp -a $WORKDIR/*.pkl $DATAOUTDIR
echo "Copied output file"
cp -a $WORKDIR/*.pt $DATAOUTDIR
echo "Copied model files"
rm -rfv $WORKDIR
echo "Removed work directory"
Commentary
The code can be broken into three main parts:
#SBATCH -J `job`
#SBATCH -N 1
#SBATCH -n 12
#SBATCH -A `account`
#SBATCH -p `partition`
#SBATCH -p long
#SBATCH --mem 50G
#SBATCH --array=0-59%10
#SBATCH --output=slurm_messages/output/`job`%A_%a.out
#SBATCH --error=slurm_messages/error/`job`%A_%a.err
#SBATCH --mail-type=END
#SBATCH --mail-user=`email_address`
This job uses a Slurm Job array to create 60 jobs with 10 active jobs at any moment.
It uses 1 node and 12 cores/CPUs (to request GPUs utilize --gres=gpu:N flag) on the node as well as 50 GB per node.
The use of 1 node is usually recommended unless the running script has some method of passing information across nodes.
Finally, Slurm pipes the output (usually print() /debug() ) to the .out files specified,
and also documents any errors in the .err file. %A is replaced by the jobid, and %a is replaced by the array-task-id.
When the job is complete, a status update is emailed.
set -e
#Cleanup functions upon exit
function cleanup() {
rm -rf $WORKDIR
}
trap cleanup EXIT SIGINT SIGTERM
The above snippet is good for garbage collection. set -e stops the execution after a error,
and trap cleanup removes the working-directory on exit (usually during an error).
The last chunk of code basically deals with the logic that you need: specifying the data input, output directories, home/work directories, and also creating a data structure of N entries with a job for each one (and running the Python script on them). The job here is given an unique name so that an unique working directory can be created.
If the code to be run requires more than just Python (and needs to be installed):
-
Commonly used packages are pre-installed on the cluster and can be found by a combination of module avail and module add .
-
Sometimes Python-packages can be installed using pip3 with the --user flag and not as a superuser.
-
If a custom piece of software (e.g. a simulator) needs to be installed, it gets a little trickier as you have to compile and build with the correct source paths (and also ensure that your ~/.bashrc has the needed linker library/package config/cmake paths).
Using an interactive shell session might help with this.
Things to Note
The cluster is setup using a home-node connected to compute-nodes. Do not work in the home-node! Rather, work in a temporary directory /local or /tmp .
Remember to clean-up the working directory and move useful data back to the home directory.
Use a screen on the remote session (so preventing the script from crashing if you log out of ssh ).
Make sure you actually need the resources that you are allocating: 1) This is difficult with memory, but easier with number of nodes or CPUs. 2) Based on how the time-share/user priority works in the cluster, requesting less resources could mean that you get bumped up the queue.
Depending on how the cluster is set up (with different processor generations) code that works on some processors would not work on others. Slurm allows you to request processors/GPUs of particular types.
In addition to being able to watch -n5 squeue -u `user` watch the progress of a job, sacct - and its associated flags allow tracking stats like CPUTime or Max. Memory Usage.
Do not annoy the cluster admin!
|