Slurm Workload Manager

Tutorial

The clusters use Slurm as workload manager. Slurm provides a large set of commands to allocate jobs, report their states, attach to running programs or cancel submissions. This section will demonstrate their usage by example to get you started.

Compute jobs can be submitted and controlled from a central login node (iffslurm) using ssh:

ssh iffslurm.iff.kfa-juelich.de

iffslurm is reachable from the internet, so there is no need to use VPN or an ssh gateway in external networks.

If the authenticity of iffslurm cannot be established, you will be asked to accept the ssh host key fingerprint. These are the currently valid fingerprints:

Type Fingerprint
RSA SHA256:yLLipu5Ti8z6B9CPtJRNm0G9tBswQav5LigPUFHqrjo
ECDSA SHA256:ehgBFD/aeTJPcC+0sosJBGOG1ef1d8oDtzxuZ7wabso
ED25519 SHA256:W0XVo86s4Wm2vhFAMwsOQd+M45P63ojFRZJQuOy1Gvk

Run the sinfo command to get a list of available queues (which are called partitions in Slurm) and free nodes:

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nanofer      up   infinite     10   idle iffcluster[0601-0610]
nanofer      up   infinite      6  alloc iffcluster[0611-0616]
nanofer      up   infinite      1  down* iffcluster0617

By default, only partitions you have access to will be listed (for nanofer members in this example). In this case, 10 nodes are free for job submissions, 6 have been allocated by other nanofer users and one node does not respond.

Next, try to submit a simple MPI job. You can take this modified ring communication example from mpitutorial.com to test multiple nodes and their interconnects:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define WAIT_BEFORE_SEND 10

int main(void) {
    int world_rank, world_size, token;
    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    if (world_rank != 0) {
        MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
    } else {
        token = 0;
    }
    sleep(WAIT_BEFORE_SEND);
    ++token;
    MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
    if (world_rank == 0) {
        MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
    }
    MPI_Finalize();
}

This simple example will span a ring from allocated compute nodes and share an incremented token with the next ring neighbor until it has passed each node once. The sleep command guarantees that the job will run long enough to try further Slurm commands.

Put the source code into a file ring.c, select the Intel compiler

source compiler-select intel

and compile the source code:

mpiicc -o ring ring.c

Notice:

The compiler-select script can be used to select from different versions of GCC and the Intel compiler. Run without any argument to get a list of all possible choices. Additionally you can add

alias compiler-select='source compiler-select'

to your .bashrc to run compiler-select without a prefixed source.

In order to submit the ring program to the cluster, you need to provide a batch script:

cat <<-EOF > ring_sbatch.sh
    #!/bin/sh
    #SBATCH -p nanofer --time=5
    srun ./ring
EOF

which can then be passed to the sbatch command:

sbatch -N4 --ntasks-per-node=12 ring_sbatch.sh -o "ring.out.%j"

The sbatch command submits a new job to the Slurm scheduler and requests 4 nodes for the job (-N4). Each node will run 12 mpi processes (--ntasks-per-node=12). The output will be written to a file ring.out. with the job id as suffix. The batch file must start with a shebang line (#!/bin/sh). The next lines starting with #SBATCH are optional configuration values for the sbatch command. In this case, the job will be limited to 5 minutes runtime on the partition named nanofer (name taken from the sinfo command). Your actual program can then be run by invoking srun (srun ./ring).

If you prefer to run commands interactively, you can allocate resources with salloc. You will be dropped into an interactive command line after your requested nodes (e.g. -N4) have been allocated for you. Here you can invoke srun manually to distribute work on the compute nodes.

Notice:

You don't need to use mpirun or mpiexec in Slurm job files. srun creates a MPI runtime environment for you implicitly.

The parameters -N and --ntasks-per-node can also be added to the batch file if you would like to hardcode the number of nodes and processes.

After your job has been submitted, you can execute squeue to list your job status:

squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  186   nanofer ring_sba     nano  R       0:03      4 iffcluster[0601,0603-0605]

The ST describes the current status of your jobs. The most important status codes are:

Code Description
CA Cancelled (by user or administrator)
CD Completed
CG Completing (some nodes are still running)
F Failed
PD Pending (waiting for free resources)
R Running
TO Timeout

More information about Slurm commands can be found on the official website.

Running interactive sessions

Demo

sinteractive-demo

Using sinteractive

Interactive compute node sessions can be started and managed with the sinteractive convenience script. Run

sinteractive

without any arguments to enter a text-based user interface. If no other interactive sessions are running, you will directly be asked for a partition for job submission. Otherwise, you will get an overview of your interactive compute node sessions:

sinteractive-main-menu

Use the arrow keys or j/k to move the menu selection, <enter> to select an entry or <esc> to exit the program. In the main menu, you can resume a running interactive session, start new sessions or cancel old ones.

New compute node sessions are automatically secured against connection loss by running a terminal multiplexer. By default, tmux is used. If you prefer screen instead, you can set

export TERMINAL_MULTIPLEXER=screen

in one of your shell startup files (for example ~/.bashrc). If screen or tmux are already running before sinteractive is executed, new sessions are attached to the running terminal multiplexer instance.

A running (tmux) session can be detached by pressing <Ctrl-b> d. Your interactive compute job will continue to run and you can resume your ssh session on iffslurm at a later point of time.

tmux supports other useful commands, for example for creating new terminal windows or split views. Visit the offical tmux guide for a very detailed or A Quick and Easy Guide to tmux for a quick introduction to the basic concepts and commands.

Notice:

By using a terminal multiplexer, your interactive jobs are also secured against accidentally closed terminal windows. However, that also implies that running sessions must be exited explicitly by the user, either with the exit command or by running the sinteractive cancel mode.

sinteractive has support for command line options. If you would like to start a new session without an interactive menu, run:

sinteractive -n -p <partition>

Or in order to enter the cancel mode immediately:

sinteractive -k

If your job needs a minimum amount of cpu cores or memory, use the --cpus and --mem switches:

sinteractive --cpus=12 --mem=40G

Passing -h or --help prints a list of all supported command line switches.

Migrating from Torque to Slurm

Slurm has compatibility wrappers for frequently used Torque commands like qdel, qhold, qrls, qstat and qsub and recognizes most #PBS directives in batch scripts. However, the wrappers only support basic functionality and PBS directives do not support more advanced Slurm features like setting node lists. Therefore, it is advisable to translate PBS batch to Slurm batch files and to get used to Slurm commands. Many Torque commands and directives have equivalent Slurm counterparts which are compared in this section.

The information of this section is taken from the official Slurm comparison sheet.

User Commands

Description Torque Slurm
Job submission qsub [script_file] sbatch [script_file]
Job deletion qdel [job_id] scancel [job_id]
Job status (by job) qstat [job_id] squeue [job_id]
Job status (by user) qstat -u [user_name] squeue -u [user_name]
Job hold qhold [job_id] scontrol hold [job_id]
Job release qrls [job_id] scontrol release [job_id]
Queue list qstat -Q squeue
Node list pbsnodes -l sinfo -N OR scontrol show nodes
Cluster status qstat -a sinfo
GUI xpbsmon sview

Batch scripts

Description Torque Slurm
Script directive #PBS #SBATCH
Queue -q [queue] -p [queue]
Node Count -l nodes=[count] -N [min[-max]]
CPU Count -l ppn=[count] OR -l mppwidth=[PE_count] -n [count]
Wall Clock Limit -l walltime=[hh:mm:ss] -t [min] OR -t [days-hh:mm:ss]
Standard Output File -o [file_name] -o [file_name]
Standard Error File -e [file_name] -e [file_name]
Combine stdout/err -j oe (both to stdout) OR -j eo (both to stderr) (use -o without -e)
Copy Environment -V --export=[ALL / NONE / variables]
Event Notification -m abe --mail-type=[events]
Email Address -M [address] --mail-user=[address]
Job Name -N [name] --job-name=[name]
Job Restart -r [y/n] --requeue OR --no-requeue
Working Directory   --workdir=[dir_name]
Resource Sharing -l naccesspolicy=singlejob --exclusive OR --shared
Memory Size -l mem=[MB] --mem=[mem][M/G/T] OR --mem-per-cpu=[mem][M/G/T]
Account to charge -W group_list=[account] --account=[account]
Tasks Per Node -l mppnppn [PEs_per_node] --tasks-per-node=[count]
CPUs Per Task   --cpus-per-task=[count]
Job Dependency -d [job_id] --depend=[state:job_id]
Job Project   --wckey=[name]
Job host preference   --nodelist=[nodes] AND/OR --exclude=[nodes]
Quality Of Service -l qos=[name] --qos=[name]
Job Arrays -t [array_spec] --array=[array_spec]
Generic Resources -l other=[resource_spec] --gres=[resource_spec]
Licenses   --licenses=[license_spec]
Begin Time -A "YYYY-MM-DD HH:MM:SS" --begin=YYYY-MM-DD[THH:MM[:SS]]

Environment variables

Description Torque Slurm
Job ID $PBS_JOBID $SLURM_JOBID
Submit Directory $PBS_O_WORKDIR $SLURM_SUBMIT_DIR
Submit Host $PBS_O_HOST $SLURM_SUBMIT_HOST
Node List $PBS_NODEFILE $SLURM_JOB_NODELIST
Job Array Index $PBS_ARRAYID $SLURM_ARRAY_TASK_ID

FAQ

How can I run multiple jobs on one node?

(Currently) Slurm assigns each job to an individual node, so it is not possible to send multiple jobs to one node with repeated sbatch calls. However, you can configure your batch script to spawn several Slurm job steps by using multiple srun commands. Since srun blocks by default and waits for one job step to finish, you must ensure that srun commands are executed in parallel. There are two recommended ways for parallel srun execution:

  1. Simple: Shell background tasks / Job control:

    Since sbatch files are normal shell scripts, you can utilize the built-in job control feature. Append & to each srun command to create a parallel background task and add a wait at the end of your batch file to wait for all background tasks to complete.

    Example with two sleep commands:

    #!/bin/sh
    #SBATCH -p th1 --nodes=1 --ntasks=2
    srun --nodes=1 --ntasks=1 --exclusive bash -c 'sleep 10 && date' &
    srun --nodes=1 --ntasks=1 --exclusive bash -c 'sleep 10 && date' &
    wait
    

    The example requests one node with at least two CPU cores. Both srun commands create a sub allocation with one processor and execute sleep 10. The trailing & creates a background task. The --exclusive option guarantees that each sub allocation will get distinct CPU resources.

  2. Advanced: Execution of many job steps with GNU parallel:

    GNU parallel is a utility to spawn parallel instances of a given command. The advantage of GNU parallel is the built-in job queue. It keeps track of submitted and completed Slurm job steps and can be used to resubmit the same sbatch file if one of the job steps exited with a failure. In this case, GNU parallel will only resubmit the Slurm job steps which were not completed successfully.

    Rewritten job control example:

    #!/bin/sh
    #SBATCH -p th1 --nodes=1 --ntasks=2
    
    # Define how GNU parallel should be executed.
    # -N 1: Number of arguments to pass to each job
    # -j ${SLURM_NTASKS}: Number of tasks GNU parallel is allow to run simultaneously.
    #                     ${SLURM_NTASKS} contains the number of reserved cores on the node.
    # --joblog parallel_joblog: Write a GNU parallel job log file.
    # --resume: Use an existing job log file to resume the session. Useful if a job must be
    #           resubmitted. Only jobs, which were not finished will be run.
    parallel_cmd="parallel -N 1 -j ${SLURM_NTASKS} --joblog parallel_joblog --resume"
    
    # Define how srun should be executed.
    # --nodes=1: Use one node for each srun call
    # --ntasks=1: Use one core for each srun call
    # --exclusive: Ensure that multiple srun calls will use distinct cpu cores
    srun_cmd="srun --nodes=1 --ntasks=1 --exclusive"
    
    # Define the command which will be executed in parallel.
    # {1} will be substituted with one given argument from the argument list.
    cmd="bash -c 'sleep {1} && date'"
    
    # Call ${cmd} with ${srun_cmd} to use allocated ressources and execute both
    # with ${parallel_cmd} to avoid blocking srun calls.
    # `10 11` will be passed as separate arguments to each ${cmd} call.
    # In this example, sleep will be called two times and will sleep 10 and 11
    # seconds in parallel.
    ${parallel_cmd} "${srun_cmd} ${cmd}" ::: 10 11