Tutorial
The clusters use Slurm as workload manager. Slurm provides a large set of commands to allocate jobs, report their states, attach to running programs or cancel submissions. This section will demonstrate their usage by example to get you started.
Compute jobs can be submitted and controlled from a central login node (iffslurm) using ssh:
ssh iffslurm.iff.kfa-juelich.de
iffslurm is reachable from the internet, so there is no need to use VPN or an ssh gateway in external networks.
If the authenticity of iffslurm cannot be established, you will be asked to accept the ssh host key fingerprint. These are the currently valid fingerprints:
Type | Fingerprint |
RSA | SHA256:yLLipu5Ti8z6B9CPtJRNm0G9tBswQav5LigPUFHqrjo |
ECDSA | SHA256:ehgBFD/aeTJPcC+0sosJBGOG1ef1d8oDtzxuZ7wabso |
ED25519 | SHA256:W0XVo86s4Wm2vhFAMwsOQd+M45P63ojFRZJQuOy1Gvk |
Run the sinfo command to get a list of available queues (which are called partitions in Slurm) and free nodes:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
nanofer up infinite 10 idle iffcluster[0601-0610]
nanofer up infinite 6 alloc iffcluster[0611-0616]
nanofer up infinite 1 down* iffcluster0617
By default, only partitions you have access to will be listed (for nanofer members in this example). In this case, 10 nodes are free for job submissions, 6 have been allocated by other nanofer users and one node does not respond.
Next, try to submit a simple MPI job. You can take this modified ring communication example from mpitutorial.com to test multiple nodes and their interconnects:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define WAIT_BEFORE_SEND 10
int main(void) {
int world_rank, world_size, token;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_rank != 0) {
MPI_Recv(&token, 1, MPI_INT, world_rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n", world_rank, token, world_rank - 1);
} else {
token = 0;
}
sleep(WAIT_BEFORE_SEND);
++token;
MPI_Send(&token, 1, MPI_INT, (world_rank + 1) % world_size, 0, MPI_COMM_WORLD);
if (world_rank == 0) {
MPI_Recv(&token, 1, MPI_INT, world_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n", world_rank, token, world_size - 1);
}
MPI_Finalize();
}
This simple example will span a ring from allocated compute nodes and share an incremented token with the next ring neighbor until it has passed each node once. The sleep command guarantees that the job will run long enough to try further Slurm commands.
Put the source code into a file ring.c, select the Intel compiler
source compiler-select intel
and compile the source code:
mpiicc -o ring ring.c
Notice:
The compiler-select script can be used to select from different versions of GCC and the Intel compiler. Run without any argument to get a list of all possible choices. Additionally you can add
alias compiler-select='source compiler-select'to your .bashrc to run compiler-select without a prefixed source.
In order to submit the ring program to the cluster, you need to provide a batch script:
cat <<-EOF > ring_sbatch.sh
#!/bin/sh
#SBATCH -p nanofer --time=5
srun ./ring
EOF
which can then be passed to the sbatch command:
sbatch -N4 --ntasks-per-node=12 ring_sbatch.sh -o "ring.out.%j"
The sbatch command submits a new job to the Slurm scheduler and requests 4 nodes for the job (-N4). Each node will run 12 mpi processes (--ntasks-per-node=12). The output will be written to a file ring.out. with the job id as suffix. The batch file must start with a shebang line (#!/bin/sh). The next lines starting with #SBATCH are optional configuration values for the sbatch command. In this case, the job will be limited to 5 minutes runtime on the partition named nanofer (name taken from the sinfo command). Your actual program can then be run by invoking srun (srun ./ring).
If you prefer to run commands interactively, you can allocate resources with salloc. You will be dropped into an interactive command line after your requested nodes (e.g. -N4) have been allocated for you. Here you can invoke srun manually to distribute work on the compute nodes.
Notice:
You don't need to use mpirun or mpiexec in Slurm job files. srun creates a MPI runtime environment for you implicitly.
The parameters -N and --ntasks-per-node can also be added to the batch file if you would like to hardcode the number of nodes and processes.
After your job has been submitted, you can execute squeue to list your job status:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
186 nanofer ring_sba nano R 0:03 4 iffcluster[0601,0603-0605]
The ST describes the current status of your jobs. The most important status codes are:
Code | Description |
---|---|
CA | Cancelled (by user or administrator) |
CD | Completed |
CG | Completing (some nodes are still running) |
F | Failed |
PD | Pending (waiting for free resources) |
R | Running |
TO | Timeout |
More information about Slurm commands can be found on the official website.
Running interactive sessions
Demo
Using sinteractive
Interactive compute node sessions can be started and managed with the sinteractive convenience script. Run
sinteractive
without any arguments to enter a text-based user interface. If no other interactive sessions are running, you will directly be asked for a partition for job submission. Otherwise, you will get an overview of your interactive compute node sessions:
Use the arrow keys or j/k to move the menu selection, <enter> to select an entry or <esc> to exit the program. In the main menu, you can resume a running interactive session, start new sessions or cancel old ones.
New compute node sessions are automatically secured against connection loss by running a terminal multiplexer. By default, tmux is used. If you prefer screen instead, you can set
export TERMINAL_MULTIPLEXER=screen
in one of your shell startup files (for example ~/.bashrc). If screen or tmux are already running before sinteractive is executed, new sessions are attached to the running terminal multiplexer instance.
A running (tmux) session can be detached by pressing <Ctrl-b> d. Your interactive compute job will continue to run and you can resume your ssh session on iffslurm at a later point of time.
tmux supports other useful commands, for example for creating new terminal windows or split views. Visit the offical tmux guide for a very detailed or A Quick and Easy Guide to tmux for a quick introduction to the basic concepts and commands.
Notice:
By using a terminal multiplexer, your interactive jobs are also secured against accidentally closed terminal windows. However, that also implies that running sessions must be exited explicitly by the user, either with the exit command or by running the sinteractive cancel mode.
sinteractive has support for command line options. If you would like to start a new session without an interactive menu, run:
sinteractive -n -p <partition>
Or in order to enter the cancel mode immediately:
sinteractive -k
If your job needs a minimum amount of cpu cores or memory, use the --cpus and --mem switches:
sinteractive --cpus=12 --mem=40G
Passing -h or --help prints a list of all supported command line switches.
Migrating from Torque to Slurm
Slurm has compatibility wrappers for frequently used Torque commands like qdel, qhold, qrls, qstat and qsub and recognizes most #PBS directives in batch scripts. However, the wrappers only support basic functionality and PBS directives do not support more advanced Slurm features like setting node lists. Therefore, it is advisable to translate PBS batch to Slurm batch files and to get used to Slurm commands. Many Torque commands and directives have equivalent Slurm counterparts which are compared in this section.
The information of this section is taken from the official Slurm comparison sheet.
User Commands
Description | Torque | Slurm |
---|---|---|
Job submission | qsub [script_file] | sbatch [script_file] |
Job deletion | qdel [job_id] | scancel [job_id] |
Job status (by job) | qstat [job_id] | squeue [job_id] |
Job status (by user) | qstat -u [user_name] | squeue -u [user_name] |
Job hold | qhold [job_id] | scontrol hold [job_id] |
Job release | qrls [job_id] | scontrol release [job_id] |
Queue list | qstat -Q | squeue |
Node list | pbsnodes -l | sinfo -N OR scontrol show nodes |
Cluster status | qstat -a | sinfo |
GUI | xpbsmon | sview |
Batch scripts
Description | Torque | Slurm |
---|---|---|
Script directive | #PBS | #SBATCH |
Queue | -q [queue] | -p [queue] |
Node Count | -l nodes=[count] | -N [min[-max]] |
CPU Count | -l ppn=[count] OR -l mppwidth=[PE_count] | -n [count] |
Wall Clock Limit | -l walltime=[hh:mm:ss] | -t [min] OR -t [days-hh:mm:ss] |
Standard Output File | -o [file_name] | -o [file_name] |
Standard Error File | -e [file_name] | -e [file_name] |
Combine stdout/err | -j oe (both to stdout) OR -j eo (both to stderr) | (use -o without -e) |
Copy Environment | -V | --export=[ALL / NONE / variables] |
Event Notification | -m abe | --mail-type=[events] |
Email Address | -M [address] | --mail-user=[address] |
Job Name | -N [name] | --job-name=[name] |
Job Restart | -r [y/n] | --requeue OR --no-requeue |
Working Directory | --workdir=[dir_name] | |
Resource Sharing | -l naccesspolicy=singlejob | --exclusive OR --shared |
Memory Size | -l mem=[MB] | --mem=[mem][M/G/T] OR --mem-per-cpu=[mem][M/G/T] |
Account to charge | -W group_list=[account] | --account=[account] |
Tasks Per Node | -l mppnppn [PEs_per_node] | --tasks-per-node=[count] |
CPUs Per Task | --cpus-per-task=[count] | |
Job Dependency | -d [job_id] | --depend=[state:job_id] |
Job Project | --wckey=[name] | |
Job host preference | --nodelist=[nodes] AND/OR --exclude=[nodes] | |
Quality Of Service | -l qos=[name] | --qos=[name] |
Job Arrays | -t [array_spec] | --array=[array_spec] |
Generic Resources | -l other=[resource_spec] | --gres=[resource_spec] |
Licenses | --licenses=[license_spec] | |
Begin Time | -A "YYYY-MM-DD HH:MM:SS" | --begin=YYYY-MM-DD[THH:MM[:SS]] |
Environment variables
Description | Torque | Slurm |
---|---|---|
Job ID | $PBS_JOBID | $SLURM_JOBID |
Submit Directory | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
Submit Host | $PBS_O_HOST | $SLURM_SUBMIT_HOST |
Node List | $PBS_NODEFILE | $SLURM_JOB_NODELIST |
Job Array Index | $PBS_ARRAYID | $SLURM_ARRAY_TASK_ID |
FAQ
How can I run multiple jobs on one node?
(Currently) Slurm assigns each job to an individual node, so it is not possible to send multiple jobs to one node with repeated sbatch calls. However, you can configure your batch script to spawn several Slurm job steps by using multiple srun commands. Since srun blocks by default and waits for one job step to finish, you must ensure that srun commands are executed in parallel. There are two recommended ways for parallel srun execution:
Simple: Shell background tasks / Job control:
Since sbatch files are normal shell scripts, you can utilize the built-in job control feature. Append & to each srun command to create a parallel background task and add a wait at the end of your batch file to wait for all background tasks to complete.
Example with two sleep commands:
#!/bin/sh #SBATCH -p th1 --nodes=1 --ntasks=2 srun --nodes=1 --ntasks=1 --exclusive bash -c 'sleep 10 && date' & srun --nodes=1 --ntasks=1 --exclusive bash -c 'sleep 10 && date' & wait
The example requests one node with at least two CPU cores. Both srun commands create a sub allocation with one processor and execute sleep 10. The trailing & creates a background task. The --exclusive option guarantees that each sub allocation will get distinct CPU resources.
Advanced: Execution of many job steps with GNU parallel:
GNU parallel is a utility to spawn parallel instances of a given command. The advantage of GNU parallel is the built-in job queue. It keeps track of submitted and completed Slurm job steps and can be used to resubmit the same sbatch file if one of the job steps exited with a failure. In this case, GNU parallel will only resubmit the Slurm job steps which were not completed successfully.
Rewritten job control example:
#!/bin/sh #SBATCH -p th1 --nodes=1 --ntasks=2 # Define how GNU parallel should be executed. # -N 1: Number of arguments to pass to each job # -j ${SLURM_NTASKS}: Number of tasks GNU parallel is allow to run simultaneously. # ${SLURM_NTASKS} contains the number of reserved cores on the node. # --joblog parallel_joblog: Write a GNU parallel job log file. # --resume: Use an existing job log file to resume the session. Useful if a job must be # resubmitted. Only jobs, which were not finished will be run. parallel_cmd="parallel -N 1 -j ${SLURM_NTASKS} --joblog parallel_joblog --resume" # Define how srun should be executed. # --nodes=1: Use one node for each srun call # --ntasks=1: Use one core for each srun call # --exclusive: Ensure that multiple srun calls will use distinct cpu cores srun_cmd="srun --nodes=1 --ntasks=1 --exclusive" # Define the command which will be executed in parallel. # {1} will be substituted with one given argument from the argument list. cmd="bash -c 'sleep {1} && date'" # Call ${cmd} with ${srun_cmd} to use allocated ressources and execute both # with ${parallel_cmd} to avoid blocking srun calls. # `10 11` will be passed as separate arguments to each ${cmd} call. # In this example, sleep will be called two times and will sleep 10 and 11 # seconds in parallel. ${parallel_cmd} "${srun_cmd} ${cmd}" ::: 10 11