Slurm Tutorial

The clusters use Slurm as workload manager. Slurm provides a large set of commands to allocate jobs, report their states, attach to running programs or cancel submissions. This section will demonstrate their usage by example to get you started.

Compute jobs can be submitted and controlled from a central login node (iffslurm) using ssh:

ssh iffslurm.iff.kfa-juelich.de

Run the sinfo command to get a list of available queues (which are called partitions in Slurm) and free nodes:

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
nanofer      up   infinite     10   idle iffcluster[0601-0610]
nanofer      up   infinite      6  alloc iffcluster[0611-0616]
nanofer      up   infinite      1  down* iffcluster0617

By default, only partitions you have access to will be listed (for nanofer members in this case). In this case, 10 nodes are free for job submissions, 6 have been allocated by other nanofer users and one node does not respond.

Next, try to submit a simple MPI job. You can take this modified ring communication example from mpitutorial.com to test multiple nodes and their interconnects:

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define WAIT_BEFORE_SEND 10

int main(void) {
    int world_rank, world_size, token;
    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    if (world_rank != 0) {
        MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
    } else {
        token = 0;
    }
    sleep(WAIT_BEFORE_SEND);
    ++token;
    MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
    if (world_rank == 0) {
        MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
    }
    MPI_Finalize();
}

This simple example will span a ring from allocated compute nodes and share an incremented token with the next ring neighbor until it has passed each node once. The sleep command guarantees that the job will run long enough to try further Slurm commands.

Put the source code into a file ring.c, select the Intel compiler

source compiler-select intel

and compile the source code:

icc -o ring ring.c -I${I_MPI_ROOT}/intel64/include -L${I_MPI_ROOT}/intel64/lib/release -lmpi

Notice:

The compiler-select script can be used to select from different versions of GCC and the Intel compiler. Run without any argument to get a list of all possible choices. Additionally you can add

alias compiler-select='source compiler-select'

to your .bashrc to run compiler-select without a prefixed source.

In order to submit the ring program to the cluster, you need to provide a batch script:

cat <<-EOF > ring_sbatch.sh
    #!/bin/sh
    #SBATCH -p nanofer --time=5
    srun ./ring
EOF

which can then be passed to the sbatch command:

sbatch -N4 --ntasks-per-node=12 ring_sbatch.sh -o "ring.out.%j"

The sbatch command submits a new job to the Slurm scheduler and requests 4 nodes for the job (-N4). Each node will run 12 mpi processes (--ntasks-per-node=12). The output will be written to a file ring.out. with the job id as suffix. The batch file must start with a shebang line (#!/bin/sh). The next lines starting with #SBATCH are optional configuration values for the sbatch command. In this case, the job will be limited to 5 minutes runtime on the partition named nanofer (name taken from the sinfo command). Your actual program can then be run by invoking srun (srun ./ring).

If you prefer to run commands interactively, you can allocate resources with salloc. You will be dropped into an interactive command line after your requested nodes (e.g. -N4) have been allocated for you. Here you can invoke srun manually to distribute work on the compute nodes.

Notice:

You don't need to use mpirun or mpiexec in Slurm job files. srun creates a MPI runtime environment for you implicitly.

The parameters -N and --ntasks-per-node can also be added to the batch file if you would like to hardcode the number of nodes and processes.

After your job has been submitted, you can execute squeue to list your job status:

squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  186   nanofer ring_sba     nano  R       0:03      4 iffcluster[0601,0603-0605]

The ST describes the current status of your jobs. The most important status codes are:

Code Description
CA Cancelled (by user or administrator)
CD Completed
CG Completing (some nodes are still running)
F Failed
PD Pending (waiting for free resources)
R Running
TO Timeout

More information about Slurm commands can be found on the official website.