SLURM SCI User Hand-on


Use

Slurm is the software responsible for managing and allocating the cluster resources when you submit a job. You need to define the job requirements in order to submit a job. The most commonly used parameters are:

  • -J JobName: Job name
  • --time=DD-HH:MM:SS: The expected time the job will run for (walltime). Format DD=days, HH=hours, MM=minutes, SS= seconds (Default dependent on partition, 1 day for batch partition)
  • --mem-per-cpu=MMMM: Memory per CPU core (in MB). Default 2048 (DefMemPerCPU=2048)
  • --ntasks=X: Number of MPI tasks. Default 1 task.
  • --cpus-per-task=Y: Number of OpenMP threads. Default 1 thread.
  • --nodes=Z: Number of nodes. Default 1 node.
  • --gpus=[type:]\<number>
    --gpus-per-node=[type:]\<number>
    --gpus-per-socket=[type:]\<number>
    --gpus-per-task=[type:]\<number>
    GPUs for the job

IMPORTANT: the more accuracy in the job requirements, the more efficient the cluster will be.

You can define those requirements as options of the sbatch command or include them in the submit script headers. Example:

#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=08:00:00 # Walltime
#SBATCH --mem-per-cpu=4096 # memory/cpu (in MB)
#SBATCH --ntasks=2 # 2 tasks
#SBATCH --cpus-per-task=4 # number of cores per task
#SBATCH --nodes=1 # number of nodes
#SBATCH --gpus-per-node=nvidia_h200:2 # gpus h200 per node

Jobs states

State Abbreviation Description
BOOT_FAIL BF Job terminated due to a launch or boot failure, typically caused by hardware issues (e.g., unable to boot a node or block and the job cannot be requeued).
CANCELLED CA Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
COMPLETED CD Job has terminated all processes on all nodes with an exit code of zero (successful completion).
COMPLETING CG Job is in the process of completing. Some processes on some nodes may still be active while SLURM cleans up resources.
CONFIGURING CF Job is being configured, typically waiting for nodes to be allocated before execution begins.
DEADLINE DL Job terminated upon reaching its deadline.
FAILED F Job terminated with a non-zero exit code or experienced another failure condition.
NODE_FAIL NF Job terminated due to failure of one or more allocated nodes.
OUT_OF_MEMORY OOM Job was terminated after exceeding its allocated memory.
PENDING PD Job is awaiting resource allocation.
PREEMPTED PR Job was terminated (or suspended) due to preemption by another job with higher priority.
RUNNING R Job currently has an allocation and is executing.
RESV_DEL_HOLD RD Job is being held because the requested reservation was deleted.
REQUEUE_HOLD RH Held job is being requeued.
REQUEUED RQ Completing or failed job has been requeued for another execution attempt.
RESIZING RS Job is changing its size (e.g., adding or releasing nodes).
REVOKED RV Job was revoked, typically due to dependency problems or system maintenance.
SIGNALING SI SLURM is signaling the job (e.g., sending SIGTERM or SIGKILL).
SPECIAL_EXIT SE Job terminated with a special exit condition defined by the system.
STOPPED ST Job has an allocation, but execution has been stopped with a SIGSTOP signal. CPUs remain allocated to the job.
SUSPENDED S Job has an allocation but execution is suspended and CPUs are released for other jobs.
TIMEOUT TO Job terminated after reaching its time limit.

Default priority

Slurm gives each job a priority, and works to free up appropriate resources for the highest-priority job. At regular intervals, Slurm will recalculate the priorities of all jobs. The priority could be based on different factors, each one with different weight:

  • Fairshare: your job will be given an initial score based on your share and your historical use of the cluster, with your recent use being given more weight.
  • Partition priority: the partition priority is defined attending to the job length. So, the longer the job, the less priority it will have.
  • Job size: jobs for which (number of cores) / (walltime) is greater get more priority.
  • Wait time: a job gets more priority as it waits. This wait-time bonus gradually increases over a certain period of time.

The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:

  • Job_priority = site_factor + (PriorityWeightAge) * (age_factor) + (PriorityWeightAssoc) * (assoc_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_\ * TRES_factor_\, ...)
  • nice_factor

Configured weights:

PriorityWeightFairShare = 10000  
PriorityWeightJobSize = 1000  
PriorityWeightPartition = 10  
PriorityWeightQOS = 1000

Running a job

Most common Slurm commands

  • sbatch: submits a jobs script.
  • scancel: cancels a running or pending job.
  • srun: runs a command across node.
  • sbcast: transfers file(s) to the compute nodes allocated for the job.
  • sattach: connect stdin/out/err for an existing job or job step
  • squeue: displays the job queue

Commonly used Slurm variables

  • $SLURM_JOBID (job id)
  • $SLURM_JOB_NODELIST (nodes allocated for job)
  • $SLURM_NNODES (number of nodes)
  • $SLURM_SUBMIT_DIR (directory job was submitted from)
  • $SLURM_ARRAY_JOB_ID (job id for the array)
  • $SLURM_ARRAY_TASK_ID (job array index value)

Temporary folders

The following temporary folders are created for each job through the job_container/tmpfs plugin that provides job-specific, private temporary file system space. Once the job is completed, the content of those folders is removed. Those temporary folders are meant to be used to perform high IO operations. In order to take advantage of these high performance file systems, you will need to stage in and out the required files.

File System Job Path Real Path
local disk /tmp /scratch/$SLURM_JOBID/.$SLURM_JOBID
local memory /dev/shm New mounted for every job

Examples of batch jobs

Let's submit the first batch job with Slurm. We will do it using the sbatch command to interact with Slurm.

[ user@ui ~ ]$ sbatch --partition=ce -t 00:01:00 --wrap "sleep 30; echo hello world"
  • The -t option stands for time and sets a limit on the total run time of the job allocation.

  • If no time limit is defined, the maximum time limit available in the default partition will be applied.

  • The --wrap option means that the following string (in "") will be turned by Slurm in a simple shell script.

Examples of interactive jobs

Interactive job selecting partition (NO MPI jobs).

[ user@ui ~ ]$ export SLURM_MPI_TYPE=none
[ user@ui ~ ]$ srun --mpi=none --partition=ce --pty bash

Interactive job selecting partition and node.

[ user@ui ~ ]$ salloc --partition=ce -w ce210

Interactive job selecting partition, memory per cpu, cores and walltime.

[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=8G -c 4 -t 06:00:00  

Interactive job selecting partition, GPU type and count, memory per CPU, cores and walltime.

[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=16G -c 4 -t 06:00:00 --gpus=t4:1

Interactive job selecting partition, memory per CPU, cores, walltime, and running Python interactively.

[ user@ui ~ ]$ srun --mpi=none --partition=ce --mem-per-cpu=4G -c 2 -t 06:00:00 srun --pty /bin/bash -c 'python3'

==💡From interactive session to avoid slurm output variables became slurm input variables:== ==unset $(compgen -v | grep "^SLURM")==

Monitoring your work on the cluster

The jobs are scheduled in terms of your relative job priority. The default command to list batch jobs is squeue. Slurm can estimate when the job is going to be scheduled (START_TIME).

[ user@ui ~ ]$ sbatch -t 00:01:00 --wrap "sleep 30; echo hello world"
Submitted batch job 2

[ user@ui ~ ]$ squeue
JOBID PARTITION NAME USER  ST TIME NODES NODELIST(REASON)
2     main      wrap user R  0:02 1     wn061
  • The first column is the job id. * The second column is the total amount of memory used by the job, in kilo-bytes.
  • The third column is the requested amount of memory in mega-bytes
  • The rows reflect the number of 'srun' commands you used in the script. The first two rows are related to the script used to submit the job, and are rather irrelevant.
  • If your application crashes, and the MaxRSS is close to the ReqMem, it might help to increase the requested memory.
  • If ReqMem is much larger than MaxRSS, you should request less.

Job efficiency

The seff perl utility is available to check an specific job efficiency:

[ user@ui ~ ]$ seff 2
Job ID: 2
Cluster: sci
User/Group: <username>/<group>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:31 core-walltime
Job Wall-clock time: 00:00:31
Memory Utilized: 860.00 KB
Memory Efficiency: 0.04% of 2.00 GB (2.00 GB/core)

Job outputs

The default output files are located in the same folder where the job was submitted with the syntax: slurm-$SLURM_JOBID.out and slurm-$SLURM_JOBID.err. You can define your own output files with the following options: - -o : name of output file. (Example: -o test.out) - -e : name of error file. (Example: -e test.err)

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter (e.g. %j):

Do not process any of the replacement symbols.

  • %%: The character "%"
  • %A: Job array's master job allocation number.
  • %a: Job array ID (index) number.
  • %J: jobid.stepid of the running job. (e.g. "128.0")
  • %j: jobid of the running job.
  • %N: short hostname. This will create a separate IO file per node.
  • %n: Node identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node.
  • %s: stepid of the running job.
  • %t: task identifier (rank) relative to current job. This will create a separate IO file per task.
  • %u: User name.
  • %x: Job name.

Canceling jobs

In order to cancel a pending or running job you can execute the following command:

[ user@ui ~ ]$ scancel [jobid]

If you want to cancel all your jobs, you can use the -u $USER as an option. Example:

[ user@ui ~ ]$ scancel -u hpcnow

Slurm scripts examples

Serial example

/nfs/admin/slurm/ops/slurm-examples/serial.sh

#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job script to run serial applications
#----------------------------------------------------------------- 

#SBATCH -J mysimplejob            # Job name
#SBATCH -o mysimplejob.%j.out     # Specify stdout output file (%j expands to jobId)
# No partition specification needed
#SBATCH -n 1                      # Total number of tasks
#SBATCH -t 01:30:00               # Run time (hh:mm:ss) - 1.5 hours
#SBATCH --mem=8G                  # Total memory demanded - 8 GB RAM

# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
ml hdf5-1.14.5/gcc-8.5.0-q24s5

# Launch the executable named "h5perf_serial"
h5perf_serial
# check the hostname
hostname

Shared memory (OpenMP) example

/nfs/admin/slurm/ops/slurm-examples/compileOpenMP.sh

#!/bin/bash
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi
icx -qopenmp -o omphello ./omphello.c

#module load gcc-13.3.0/gcc-8.5.0-rt6fd
#gcc -fopenmp -o omphello ./omphello.c

/nfs/admin/slurm/ops/slurm-examples/OpenMP.sh

#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run OpenMP applications
#----------------------------------------------------
#SBATCH -J openmp_job          # Job name
#SBATCH -o openmp_job.o%j      # Name of stdout output file(%j expands to jobId)
#SBATCH -e openmp_job.o%j      # Name of stderr output file(%j expands to jobId)
#SBATCH -c 8                   # Cores per task requested (1 task job)
#SBATCH -t 00:10:00            # Run time (hh:mm:ss) - 10 min
#SBATCH --mem-per-cpu=3G       # Memory per core demandes (24 GB in total: 3GB * 8 cores)

# This example will run an OpenMP application using 8 threads
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi

# Run the OpenMP application
./omphello

Distributed memory (MPI) examples

/nfs/admin/slurm/ops/slurm-examples/compileMPI.sh

#!/bin/bash
module load openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq
mpifort -o pi ./pi3f90.f90

/nfs/admin/slurm/ops/slurm-examples/MPI.sh

#!/bin/bash
#----------------------------------------------------
# Generic SLURM script -- MPI Hello World
#
# This script requests 2 nodes and 20 cores/node
# for a total of 40 MPI tasks.
#----------------------------------------------------
#SBATCH -J mpijob              # Job name
#SBATCH -o mpijob.%j.out       # stdout; %j expands to jobid
#SBATCH -e mpijob.%j.err       # stderr; skip to combine stdout and stderr
#SBATCH -N 2                   # Number of nodes, not cores (64 cores/node)
#SBATCH -n 40                  # Total number of MPI tasks (if omitted, n=N)
#SBATCH --ntasks-per-node=20   # MPI tasks per node
#SBATCH -t 00:30:00            # max time
#SBATCH --mem-per-cpu=1G       # memory per core demanded

ml openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq

srun ./pi                   # Do not add "-n" or "-np" options here. SLURM infers the
                            # process count from the "-N" and "-n" directives above.

Hybrid example (MPI+OpenMPI)

/nfs/admin/slurm/ops/slurm-examples/compileMPIOpenMP.sh /nfs/admin/slurm/ops/slurm-examples/help_fortran_find_core_id.c /nfs/admin/slurm/ops/slurm-examples/hybrid.f90 /nfs/admin/slurm/ops/slurm-examples/hybrid.c

#!/bin/bash
# FORTRAN
# INTEL
#ml iimpi/2022b
#icx -c help_fortran_find_core_id.c
#mpiifort -qopenmp -o hybrid ./hybrid.f90 help_fortran_find_core_id.o
# GNU
#ml gompi/2022b
#gcc -c help_fortran_find_core_id.c
#mpif90 -fopenmp -ffree-line-length-256 -o hybrid ./hybrid.f90 help_fortran_find_core_id.o

# C
ml iimpi/2022b
mpiicc -cc=icx -qopenmp -Wimplicit-function-declaration -o hybrid ./hybrid.c
# GNU
#ml gompi/2022b
#mpicc -fopenmp o hybrid ./hybrid.c

/nfs/admin/slurm/ops/slurm-examples/MPIOpenMP.sh

#!/bin/bash
#SBATCH -J MPIOpenMP -o %x-%J.out
#SBATCH -t 00:20:00
#SBATCH -n 8 --ntasks-per-node=4 -c 8
ml iimpi/2022b
#ml gompi/2022b
srun --cpu_bind=verbose ./hybrid

==💡 Use the –exclusive option to get full node allocation.==

If we use with full node allocation (sbatch –exclusive):

[ user@ui ~ ]$ srun --cpu_bind=verbose,mask_cpu:
0x000000ff,
0x0000ff00,
0x00ff0000,
0xff000000
./hybrid

Job array example

/nfs/admin/slurm/ops/slurm-examples/JobArray.sh

#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job array script
#-----------------------------------------------------------------
#SBATCH -J jobarray # Job name
#SBATCH -o %x-%A-%a.out # Specify stdout output file (%A expands to array jobId, %a expands to array task id)
\# SBATCH -a 1,6,16-32
#SBATCH --array=0-15:4
#SBATCH -n 1  # Total number of tasks and memory per core
#SBATCH -t 00:10:00 # Run time (hh:mm:ss) - 1.5 hours

# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
# module load ...

# run the task:
./task.sh

GPU job example - ==NO GPU==

/nfs/admin/slurm/ops/slurm-examples/compileCUDA.sh

#!/bin/bash
#cuda samples repo needed:
#git clone -b v12.1 https://github.com/NVIDIA/cuda-samples.git
#Compilation should be done on a gpu node:
#salloc -c 8 --gpus 1
ml CUDA/12.1.1 GCC
nvcc -o deviceQuery_cuda12 -I./cuda-samples/Common \
cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery.cpp    
````

`/nfs/admin/slurm/ops/slurm-examples/CUDA.sh`

```bash
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run CUDA applications
#----------------------------------------------------
#SBATCH -J gpu_job                # Job name
#SBATCH -o gpu_job.o%j            # Name of stdout output file(%j expands to jobId)
#SBATCH -e gpu_job.o%j            # Name of stderr output file(%j expands to jobId)
#SBATCH -c 32 –mem-per-cpu=2G     # Cores per task requested (1 task job)
#SBATCH --gpus=nvidia_h200:2      # Options for requesting 2 GPUs
#SBATCH -t 01:30:00               # Run time (hh:mm:ss) - 1.5 hours

# Run the CUDA application
module load CUDA/12.1.1
./deviceQuery_cuda12

Enroot + Pyxis example

How to use Enroot

Pull the image from docker hub

[ user@ui ~ ]$ enroot import docker://ubuntu

Create the container

[ user@ui ~ ]$ enroot create library+ubuntu+latest.sqsh

Start the container

[ user@ui ~ ]$ enroot start library+ubuntu+latest

If you need to run something as root inside the container, you can use the --root option.

[ user@ui ~ ]$ enroot start --root library+ubuntu+latest

List the existing containers

[ user@ui ~ ]$ enroot list -f

Remove a container

[ user@ui ~ ]$ enroot remove library+ubuntu+latest
How to use Pyxis

Run a command on a node

[ user@ui ~ ]$ srun -p main cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"

Run the same command, but now inside of a container

[ user@ui ~ ]$ srun -p main --container-image=$(pwd)/library+ubuntu+latest.sqsh --container-name=ubuntu cat /etc/os-release
…
PRETTY_NAME="Ubuntu 24.04.3 LTS"
…

Mount a file from the host and run the command on it, from inside the container

[ user@ui ~ ]$ srun -p main --container-name=ubuntu --container-mounts=/etc/os-release:/etc/os-release cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"

To see more options

[ user@ui ~ ]$ srun --help | grep container

Execute the sbatch script inside a container image, a real application example GROMACS

#!/bin/bash
#SBATCH -p main -t 30:00
#SBATCH --container-mounts /var/spool/slurm,/nfs/home/<group>/<username>/slurm-sci/stmv:/host_pwd
#SBATCH --container-workdir=/host_pwd
#SBATCH --container-image nvcr.io\#hpc/gromacs:2021.3
#SBATCH --container-image  /nfs/home/<group>/<username>/slurm-sci/hpc+gromacs+2021.3.sqsh
#SBATCH --container-name hpc+gromacs+2021.3
export GMX_ENABLE_DIRECT_GPU_COMM=1
/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123