Slurm Workload Manager

Tutorial

The clusters use Slurm as workload manager. Slurm provides a large set of commands to allocate jobs, report their states, attach to running programs or cancel submissions. This section will demonstrate their usage by example to get you started.

Compute jobs can be submitted and controlled from a central login node (iffslurm) using ssh:

ssh iffslurm.iff.kfa-juelich.de

iffslurm is reachable from the internet, so there is no need to use VPN or an ssh gateway in external networks.

If the authenticity of iffslurm cannot be established, you will be asked to accept the ssh host key fingerprint. These are the currently valid fingerprints:

Type	Fingerprint
RSA	`SHA256:yLLipu5Ti8z6B9CPtJRNm0G9tBswQav5LigPUFHqrjo`
ECDSA	`SHA256:ehgBFD/aeTJPcC+0sosJBGOG1ef1d8oDtzxuZ7wabso`
ED25519	`SHA256:W0XVo86s4Wm2vhFAMwsOQd+M45P63ojFRZJQuOy1Gvk`

Run the sinfo command to get a list of available queues (which are called partitions in Slurm) and free nodes:

sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nanofer      up   infinite     10   idle iffcluster[0601-0610]
nanofer      up   infinite      4  alloc iffcluster[0611-0614]
nanofer      up   infinite      2    mix iffcluster[0615-0616]
nanofer      up   infinite      1  down* iffcluster0617

By default, only partitions you have access to will be listed (for nanofer members in this example). In this case, 10 nodes are free for job submissions, 6 are in use by other nanofer users (alloc + mix state) and one node does not respond.

Slurm distinguishes between alloc nodes and mix nodes since nodes can be used by multiple jobs and / or users at once. Nodes are in alloc state if all of their available resources (like CPU cores or RAM) are allocated for Slurm jobs and the node cannot accept any further job submission. A node in mix state has some free resources left and can accept further jobs if their resource requirements can be met.

Next, try to submit a simple MPI job. You can take this modified ring communication example from mpitutorial.com to test multiple nodes and their interconnects:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define WAIT_BEFORE_SEND 5

int main(void) {
    int world_rank, world_size, token;
    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    if (world_rank != 0) {
        MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
    } else {
        token = 0;
    }
    sleep(WAIT_BEFORE_SEND);
    ++token;
    MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
    if (world_rank == 0) {
        MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
    }
    MPI_Finalize();
}

This simple example will span a ring from allocated compute nodes and share an incremented token with the next ring neighbor until it has passed each node once. The sleep command guarantees that the job will run long enough to try further Slurm commands.

Put the source code into a file ring.c, select the Intel compiler

source compiler-select intel-oneapi

and compile the source code:

mpiicx -o ring ring.c

Notice:

The compiler-select script can be used to select from different versions of GCC and the Intel compiler. Run without any argument to get a list of all possible choices. Additionally you can add
alias compiler-select='source compiler-select'
to your .bashrc to run compiler-select without a prefixed source.

See How can I use another MPI implementation? for a list of possible compiler / MPI combinations.

In order to submit the ring program to the cluster, you need to provide a batch script:

cat <<-EOF > ring_sbatch.sh
    #!/bin/sh
    #SBATCH -p nanofer
    #SBATCH -N4
    #SBATCH --ntasks-per-node=12
    #SBATCH -o "ring.out.%j"
    #SBATCH --exclusive
    srun ./ring
EOF

which can then be passed to the sbatch command:

sbatch ring_sbatch.sh

The sbatch command submits a new job to the Slurm scheduler in form of a batch script. The batch file must start with a shebang line (#!/bin/sh). The next lines starting with #SBATCH are optional configuration values for the sbatch command. In this case, the job will run on the partition named nanofer (name taken from the sinfo command) and requests 4 nodes for the whole job (-N4). Each node will run 12 MPI processes (--ntasks-per-node=12). The output will be written to a file ring.out. with the job id as suffix. The --exclusive option disables node sharing, so other jobs cannot use your nodes even if some resources stay empty (for example if a node has more than the 12 requested CPU cores). Your actual program is run by invoking srun (srun ./ring). srun detects automatically that ring is an MPI executable and provides the needed MPI environment for execution. You don't need to call mpirun or mpiexec.

If you prefer to run commands interactively, you can allocate resources with salloc. The salloc command accepts all options that can be given in #SBATCH comments. You will be dropped into an interactive command line after your requested nodes (e.g. -N4) have been allocated for you. Here you can invoke srun manually to distribute work on the compute nodes.

After your job has been submitted, you can execute squeue to list your job status:

squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  186   nanofer ring_sba     nano  R       0:03      4 iffcluster[0601,0603-0605]

The ST describes the current status of your jobs. The most important status codes are:

Code	Description
CA	Cancelled (by user or administrator)
CD	Completed
CG	Completing (some nodes are still running)
F	Failed
PD	Pending (waiting for free resources)
R	Running
TO	Timeout

More information about Slurm commands can be found on the official website.

Further examples for typical use cases, like running programs with shared memory parallelization (e.g. OpenMP) or multiple sequential programs in one Slurm job, can be found in the FAQ section.

Running interactive sessions

Demo

Using `sinteractive`

Interactive compute node sessions can be started and managed with the sinteractive convenience script. Run

sinteractive

without any arguments to enter a text-based user interface. If no other interactive sessions are running, you will directly be asked for a partition for job submission. Otherwise, you will get an overview of your interactive compute node sessions:

Use the arrow keys or j/k to move the menu selection, <enter> to select an entry or <esc> to exit the program. In the main menu, you can resume a running interactive session, start new sessions or cancel old ones.

New compute node sessions are automatically secured against connection loss by running a terminal multiplexer. By default, tmux is used. If you prefer screen instead, you can set

export TERMINAL_MULTIPLEXER=screen

in one of your shell startup files (for example ~/.bashrc). If screen or tmux are already running before sinteractive is executed, new sessions are attached to the running terminal multiplexer instance.

A running (tmux) session can be detached by pressing <Ctrl-b> d. Your interactive compute job will continue to run and you can resume your ssh session on iffslurm at a later point of time.

tmux supports other useful commands, for example for creating new terminal windows or split views. Visit the offical tmux guide for a very detailed or A Quick and Easy Guide to tmux for a quick introduction to the basic concepts and commands.

Notice:

By using a terminal multiplexer, your interactive jobs are also secured against accidentally closed terminal windows. However, that also implies that running sessions must be exited explicitly by the user, either with the exit command or by running the sinteractive cancel mode.

sinteractive has support for command line options. If you would like to start a new session without an interactive menu, run:

sinteractive -n -p <partition>

Or in order to enter the cancel mode immediately:

sinteractive -k

If your job needs a minimum amount of cpu cores or memory, use the --cpus and --mem switches:

sinteractive --cpus=12 --mem=40G

Passing -h or --help prints a list of all supported command line switches.

FAQ

How can I run multiple jobs on one node?

You can send multiple jobs to one node by repeated sbatch calls. If you request only the resources which one job needs, Slurm will manage to distribute your jobs over the cluster and the default node sharing allows jobs to be placed on the same node without wasting resources.

Example:

You would like to run multiple instances of serial_program. You define a batch script which requests only one core on one node and a specific amount of RAM which is reserved for the serial job:

#!/bin/sh
#SBATCH -p th1 --nodes=1 --ntasks=1 --mem=4G
srun serial_program "${input_file}"

This batch script (saved as serial_batch.sh) can then be submitted multiple times:

sbatch --export=input_file="file1.in" serial_batch.sh
sbatch --export=input_file="file2.in" serial_batch.sh
sbatch --export=input_file="file3.in" serial_batch.sh
sbatch --export=input_file="file4.in" serial_batch.sh

The --export option can be used to pass environment variables into the batch script and use different input files for every job submission.

Important:

Always specify the --mem option if a want to share a node with other jobs. If omitted, Slurm will allocate all available memory on the node, effectively disallowing other jobs to enter the node since no RAM is left.

If you don't know how much RAM (or CPU cores) are available on the nodes of your partition, you can run
sinfo -p <partiton-name> --format "%n %c %m"
to get a list of all installed nodes within a partition and their available resources. RAM is specified in Megabytes.

However, many users prefer to allocate one node completely to start multiple serial program runs until the node is fully loaded. This has the advantage of better resource control since only one node is allocated at a time. For this to accomplish, you can configure your batch script to spawn several Slurm job steps by using multiple srun commands. Since srun blocks by default and waits for one job step to finish, you must ensure that srun commands are executed in parallel. There are two recommended ways for parallel srun execution:

Simple: Shell background tasks / Shell job control:

Since sbatch files are normal shell scripts, you can utilize the built-in job control feature. Append & to each srun command to create a parallel background task and add a wait at the end of your batch file to wait for all background tasks to complete.

Example with sleep commands:
```
#!/bin/sh
#SBATCH -p th1
#SBATCH --exclusive
#SBATCH --nodes=1
srun --ntasks=1 --exact bash -c 'sleep 10 && date' &
srun --ntasks=1 --exact bash -c 'sleep 10 && date' &
srun --ntasks=1 --exact bash -c 'sleep 10 && date' &
srun --ntasks=1 --exact bash -c 'sleep 10 && date' &
wait
```
The example requests one node in exclusive mode, so all CPU cores are assigned to your job. The srun commands create a sub allocation with one processor and execute sleep 10. The trailing & creates a background task. The --exact option guarantees that each sub allocation will get exactly one (distinct) CPU core. Without the --exact option, all job steps would get at least one CPU core but take as many CPU cores as still available in the whole job allocation. As a result, the first job step would gather all available CPU cores and the remaining job steps must wait until the previous job step is finished, effectively running no job steps in parallel at all.

It may be confusing, but in this example, no --mem option is needed for the different job steps (srun commands). If the --mem option is omitted, all job steps share the available memory of the whole job.

Advanced: Execution of many job steps with GNU parallel:

GNU parallel is a utility to spawn parallel instances of a given command. The advantage of GNU parallel is the built-in job queue. It keeps track of submitted and completed Slurm job steps and can be used to resubmit the same sbatch file if one of the job steps exited with a failure. In this case, GNU parallel will only resubmit the Slurm job steps which were not completed successfully.

Rewritten job control example:

#!/bin/sh
#SBATCH -p th1 --nodes=1 --ntasks=4 --exclusive

# Define how GNU parallel should be executed.
# -N 1: Number of arguments to pass to each job
# -j ${SLURM_NTASKS}: Number of tasks GNU parallel is allow to run simultaneously.
#                     ${SLURM_NTASKS} contains the number of reserved cores on the node.
# --joblog parallel_joblog: Write a GNU parallel job log file.
# --resume: Use an existing job log file to resume the session. Useful if a job must be
#           resubmitted. Only jobs, which were not finished will be run.
parallel_cmd="parallel -N 1 -j ${SLURM_NTASKS} --joblog parallel_joblog --resume"

# Define how srun should be executed.
# --nodes=1: Use one node for each srun call
# --ntasks=1: Use one core for each srun call
# --exact: Ensure that multiple srun calls will only use exactly one distinct cpu core
srun_cmd="srun --nodes=1 --ntasks=1 --exact"

# Define the command which will be executed in parallel.
# {1} will be substituted with one given argument from the argument list.
cmd="bash -c 'sleep {1} && date'"

# Call ${cmd} with ${srun_cmd} to use allocated ressources and execute both
# with ${parallel_cmd} to avoid blocking srun calls.
# `10 11 12 13` will be passed as separate arguments to each ${cmd} call.
# In this example, sleep will be called four times and will sleep 10, 11, 12 and 13
# seconds in parallel.
${parallel_cmd} "${srun_cmd} ${cmd}" ::: 10 11 12 13

What should be considered when running shared memory parallelization / OpenMP jobs?

Slurm strictly distinguishes between processes (which are used in distributed memory parallelization like MPI) and threads (used in shared memory distribution parallelization like OpenMP or POSIX pthreads). Therefore, it is important to use the correct resource requesting options in sbatch files and srun commands.

These are the most common cases:

Run an OpenMP / threaded program on one node using all available CPU cores:

Create a batch script that allocates one process on one node and assign all CPU cores to this process:
```
#!/bin/bash
#SBATCH -p th1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --exclusive

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
srun ./openmp_program
```
Replace 12 with the number of cores of a single compute node in your Slurm partition. If you don't know the exact count of cores, you can run
```
sinfo -p <your-partition> --format "%n %c"
```
to get a list of available cores per node in the selected partition.
Run an MPI-only program without any threading / OpenMP:

Create a batch script and specify how many nodes and MPI processes per node you would like to use:
```
#!/bin/bash
#SBATCH -p th1
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=12
#SBATCH --exclusive

srun ./mpi_program
```
This example allocates 4 nodes, with 12 MPI processes each and disallows other jobs to enter the nodes. The srun command created the needed MPI execution environment implicitly.
Combine MPI and OpenMP parallelization:

Create a batch script and configure how many nodes, MPI processes and OpenMP threads per node you would like to use:
```
#!/bin/bash
#SBATCH -p th1
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=3
#SBATCH --exclusive

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
srun ./mpi_program
```
This script requests 4 nodes with 4 MPI processes per node. Every MPI processes will run 3 OpenMP threads.

How can I use another MPI implementation?

The cluster has multiple MPI installations for different compiler and MPI implementation combinations. You can choose between:

Intel compiler + Intel MPI:

Run the compiler-select script to load the Intel environment:
```
source compiler-select intel-oneapi
```
Now, you can compile your programs with mpiicx (C compiler), mpiicpx (C++ compiler) and mpiifx (Fortran compiler). Job submissions will also use the Intel MPI implementation.

Therefore, it is important to load the correct MPI environment for job submissions!

The older Intel classic compilers are also still available, but deprecated (mpiicc, mpiicpc and mpiifort).

Important:

The Intel compiler brings its own MPI implementation. Do not combine the Intel compiler with the OpenMPI or MPICH since they are not compatible. OpenMPI and MPICH installed on the cluster can only be combined with GCC.
GCC + OpenMPI:

GCC can be used with different MPI implementations but it is recommended to use OpenMPI with GCC. GCC 11 is the default compiler. If you need older or newer versions of GCC, run
```
compiler-select
```
without any arguments to get list of available versions. For example, execute
```
source compiler-select gcc12
```
to load GCC 12.

Run
```
source compiler-select openmpi
```
to load the OpenMPI environment, set needed paths to mpicc, mpic++ and mpifort and configure the Slurm integration. Compile your program and submit your jobs within this environment.

The selection of GCC and OpenMPI can also be combined in a single call:
```
source compiler-select gcc12 openmpi
```
GCC + MPICH:

Run
```
source compiler-select mpich
```
to load the MPICH environment. Like OpenMPI, MPICH can be combined with different GCC versions.