SLURM SCI User Hand-on

Use

Slurm is the software responsible for managing and allocating the cluster resources when you submit a job. You need to define the job requirements in order to submit a job. The most commonly used parameters are:

-J JobName: Job name
--time=DD-HH:MM:SS: The expected time the job will run for (walltime). Format DD=days, HH=hours, MM=minutes, SS= seconds (Default dependent on partition, 1 day for batch partition)
--mem-per-cpu=MMMM: Memory per CPU core (in MB). Default 2048 (DefMemPerCPU=2048)
--ntasks=X: Number of MPI tasks. Default 1 task.
--cpus-per-task=Y: Number of OpenMP threads. Default 1 thread.
--nodes=Z: Number of nodes. Default 1 node.
--gpus=[type:]\<number>
--gpus-per-node=[type:]\<number>
--gpus-per-socket=[type:]\<number>
--gpus-per-task=[type:]\<number> GPUs for the job

IMPORTANT: the more accuracy in the job requirements, the more efficient the cluster will be.

You can define those requirements as options of the sbatch command or include them in the submit script headers. Example:

#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=08:00:00 # Walltime
#SBATCH --mem-per-cpu=4096 # memory/cpu (in MB)
#SBATCH --ntasks=2 # 2 tasks
#SBATCH --cpus-per-task=4 # number of cores per task
#SBATCH --nodes=1 # number of nodes
#SBATCH --gpus-per-node=nvidia_h200:2 # gpus h200 per node

Jobs states

State	Abbreviation	Description
BOOT_FAIL	BF	Job terminated due to a launch or boot failure, typically caused by hardware issues (e.g., unable to boot a node or block and the job cannot be requeued).
CANCELLED	CA	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
COMPLETED	CD	Job has terminated all processes on all nodes with an exit code of zero (successful completion).
COMPLETING	CG	Job is in the process of completing. Some processes on some nodes may still be active while SLURM cleans up resources.
CONFIGURING	CF	Job is being configured, typically waiting for nodes to be allocated before execution begins.
DEADLINE	DL	Job terminated upon reaching its deadline.
FAILED	F	Job terminated with a non-zero exit code or experienced another failure condition.
NODE_FAIL	NF	Job terminated due to failure of one or more allocated nodes.
OUT_OF_MEMORY	OOM	Job was terminated after exceeding its allocated memory.
PENDING	PD	Job is awaiting resource allocation.
PREEMPTED	PR	Job was terminated (or suspended) due to preemption by another job with higher priority.
RUNNING	R	Job currently has an allocation and is executing.
RESV_DEL_HOLD	RD	Job is being held because the requested reservation was deleted.
REQUEUE_HOLD	RH	Held job is being requeued.
REQUEUED	RQ	Completing or failed job has been requeued for another execution attempt.
RESIZING	RS	Job is changing its size (e.g., adding or releasing nodes).
REVOKED	RV	Job was revoked, typically due to dependency problems or system maintenance.
SIGNALING	SI	SLURM is signaling the job (e.g., sending `SIGTERM` or `SIGKILL`).
SPECIAL_EXIT	SE	Job terminated with a special exit condition defined by the system.
STOPPED	ST	Job has an allocation, but execution has been stopped with a `SIGSTOP` signal. CPUs remain allocated to the job.
SUSPENDED	S	Job has an allocation but execution is suspended and CPUs are released for other jobs.
TIMEOUT	TO	Job terminated after reaching its time limit.

Default priority

Slurm gives each job a priority, and works to free up appropriate resources for the highest-priority job. At regular intervals, Slurm will recalculate the priorities of all jobs. The priority could be based on different factors, each one with different weight:

Fairshare: your job will be given an initial score based on your share and your historical use of the cluster, with your recent use being given more weight.
Partition priority: the partition priority is defined attending to the job length. So, the longer the job, the less priority it will have.
Job size: jobs for which (number of cores) / (walltime) is greater get more priority.
Wait time: a job gets more priority as it waits. This wait-time bonus gradually increases over a certain period of time.

The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:

Job_priority = site_factor + (PriorityWeightAge) * (age_factor) + (PriorityWeightAssoc) * (assoc_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_\ * TRES_factor_\, ...)
nice_factor

Configured weights:

PriorityWeightFairShare = 10000  
PriorityWeightJobSize = 1000  
PriorityWeightPartition = 10  
PriorityWeightQOS = 1000

Running a job

Most common Slurm commands

sbatch: submits a jobs script.
scancel: cancels a running or pending job.
srun: runs a command across node.
sbcast: transfers file(s) to the compute nodes allocated for the job.
sattach: connect stdin/out/err for an existing job or job step
squeue: displays the job queue

Commonly used Slurm variables

$SLURM_JOBID (job id)
$SLURM_JOB_NODELIST (nodes allocated for job)
$SLURM_NNODES (number of nodes)
$SLURM_SUBMIT_DIR (directory job was submitted from)
$SLURM_ARRAY_JOB_ID (job id for the array)
$SLURM_ARRAY_TASK_ID (job array index value)

Temporary folders

The following temporary folders are created for each job through the job_container/tmpfs plugin that provides job-specific, private temporary file system space. Once the job is completed, the content of those folders is removed. Those temporary folders are meant to be used to perform high IO operations. In order to take advantage of these high performance file systems, you will need to stage in and out the required files.

File System	Job Path	Real Path
local disk	/tmp	`/scratch/$SLURM_JOBID/.$SLURM_JOBID`
local memory	/dev/shm	New mounted for every job

Examples of batch jobs

Let's submit the first batch job with Slurm. We will do it using the sbatch command to interact with Slurm.

[ user@ui ~ ]$ sbatch --partition=ce -t 00:01:00 --wrap "sleep 30; echo hello world"

The -t option stands for time and sets a limit on the total run time of the job allocation.
If no time limit is defined, the maximum time limit available in the default partition will be applied.
The --wrap option means that the following string (in "") will be turned by Slurm in a simple shell script.

Examples of interactive jobs

Interactive job selecting partition (NO MPI jobs).

[ user@ui ~ ]$ export SLURM_MPI_TYPE=none

[ user@ui ~ ]$ srun --mpi=none --partition=ce --pty bash

Interactive job selecting partition and node.

[ user@ui ~ ]$ salloc --partition=ce -w ce210

Interactive job selecting partition, memory per cpu, cores and walltime.

[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=8G -c 4 -t 06:00:00

Interactive job selecting partition, GPU type and count, memory per CPU, cores and walltime.

[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=16G -c 4 -t 06:00:00 --gpus=t4:1

Interactive job selecting partition, memory per CPU, cores, walltime, and running Python interactively.

[ user@ui ~ ]$ srun --mpi=none --partition=ce --mem-per-cpu=4G -c 2 -t 06:00:00 srun --pty /bin/bash -c 'python3'

==💡From interactive session to avoid slurm output variables became slurm input variables:== ==unset $(compgen -v | grep "^SLURM")==

Monitoring your work on the cluster

The jobs are scheduled in terms of your relative job priority. The default command to list batch jobs is squeue. Slurm can estimate when the job is going to be scheduled (START_TIME).

[ user@ui ~ ]$ sbatch -t 00:01:00 --wrap "sleep 30; echo hello world"
Submitted batch job 2

[ user@ui ~ ]$ squeue
JOBID PARTITION NAME USER  ST TIME NODES NODELIST(REASON)
2     main      wrap user R  0:02 1     wn061

The first column is the job id. * The second column is the total amount of memory used by the job, in kilo-bytes.
The third column is the requested amount of memory in mega-bytes
The rows reflect the number of 'srun' commands you used in the script. The first two rows are related to the script used to submit the job, and are rather irrelevant.
If your application crashes, and the MaxRSS is close to the ReqMem, it might help to increase the requested memory.
If ReqMem is much larger than MaxRSS, you should request less.

Job efficiency

The seff perl utility is available to check an specific job efficiency:

[ user@ui ~ ]$ seff 2
Job ID: 2
Cluster: sci
User/Group: <username>/<group>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:31 core-walltime
Job Wall-clock time: 00:00:31
Memory Utilized: 860.00 KB
Memory Efficiency: 0.04% of 2.00 GB (2.00 GB/core)

Job outputs

The default output files are located in the same folder where the job was submitted with the syntax: slurm-$SLURM_JOBID.out and slurm-$SLURM_JOBID.err. You can define your own output files with the following options: - -o : name of output file. (Example: -o test.out) - -e : name of error file. (Example: -e test.err)

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter (e.g. %j):

Do not process any of the replacement symbols.

%%: The character "%"
%A: Job array's master job allocation number.
%a: Job array ID (index) number.
%J: jobid.stepid of the running job. (e.g. "128.0")
%j: jobid of the running job.
%N: short hostname. This will create a separate IO file per node.
%n: Node identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node.
%s: stepid of the running job.
%t: task identifier (rank) relative to current job. This will create a separate IO file per task.
%u: User name.
%x: Job name.

Canceling jobs

In order to cancel a pending or running job you can execute the following command:

[ user@ui ~ ]$ scancel [jobid]

If you want to cancel all your jobs, you can use the -u $USER as an option. Example:

[ user@ui ~ ]$ scancel -u hpcnow

Slurm scripts examples

Serial example

/nfs/admin/slurm/ops/slurm-examples/serial.sh

#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job script to run serial applications
#----------------------------------------------------------------- 

#SBATCH -J mysimplejob            # Job name
#SBATCH -o mysimplejob.%j.out     # Specify stdout output file (%j expands to jobId)
# No partition specification needed
#SBATCH -n 1                      # Total number of tasks
#SBATCH -t 01:30:00               # Run time (hh:mm:ss) - 1.5 hours
#SBATCH --mem=8G                  # Total memory demanded - 8 GB RAM

# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
ml hdf5-1.14.5/gcc-8.5.0-q24s5

# Launch the executable named "h5perf_serial"
h5perf_serial
# check the hostname
hostname

Shared memory (OpenMP) example

/nfs/admin/slurm/ops/slurm-examples/compileOpenMP.sh

#!/bin/bash
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi
icx -qopenmp -o omphello ./omphello.c

#module load gcc-13.3.0/gcc-8.5.0-rt6fd
#gcc -fopenmp -o omphello ./omphello.c

/nfs/admin/slurm/ops/slurm-examples/OpenMP.sh

#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run OpenMP applications
#----------------------------------------------------
#SBATCH -J openmp_job          # Job name
#SBATCH -o openmp_job.o%j      # Name of stdout output file(%j expands to jobId)
#SBATCH -e openmp_job.o%j      # Name of stderr output file(%j expands to jobId)
#SBATCH -c 8                   # Cores per task requested (1 task job)
#SBATCH -t 00:10:00            # Run time (hh:mm:ss) - 10 min
#SBATCH --mem-per-cpu=3G       # Memory per core demandes (24 GB in total: 3GB * 8 cores)

# This example will run an OpenMP application using 8 threads
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi

# Run the OpenMP application
./omphello

Distributed memory (MPI) examples

/nfs/admin/slurm/ops/slurm-examples/compileMPI.sh

#!/bin/bash
module load openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq
mpifort -o pi ./pi3f90.f90

/nfs/admin/slurm/ops/slurm-examples/MPI.sh

#!/bin/bash
#----------------------------------------------------
# Generic SLURM script -- MPI Hello World
#
# This script requests 2 nodes and 20 cores/node
# for a total of 40 MPI tasks.
#----------------------------------------------------
#SBATCH -J mpijob              # Job name
#SBATCH -o mpijob.%j.out       # stdout; %j expands to jobid
#SBATCH -e mpijob.%j.err       # stderr; skip to combine stdout and stderr
#SBATCH -N 2                   # Number of nodes, not cores (64 cores/node)
#SBATCH -n 40                  # Total number of MPI tasks (if omitted, n=N)
#SBATCH --ntasks-per-node=20   # MPI tasks per node
#SBATCH -t 00:30:00            # max time
#SBATCH --mem-per-cpu=1G       # memory per core demanded

# Desactivar UCX (usa TCP y shared-memory)
# export OMPI_MCA_pml=ob1
# export OMPI_MCA_btl=self,vader,tcp

# Indicar explícitamente el interfaz de red (Solo necesario en la particion geocean)
# https://www.mail-archive.com/search?l=users@lists.open-mpi.org&q=subject:%22Re\%3A+\[OMPI+users\]+Unable+to+run+complicated+MPI+Program%22&o=newest&f=1
# export OMPI_MCA_btl_tcp_if_include=eth0

ml openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq

srun ./pi                   # Do not add "-n" or "-np" options here. SLURM infers the
                            # process count from the "-N" and "-n" directives above.

Hybrid example (MPI+OpenMPI)

/nfs/admin/slurm/ops/slurm-examples/compileMPIOpenMP.sh /nfs/admin/slurm/ops/slurm-examples/help_fortran_find_core_id.c /nfs/admin/slurm/ops/slurm-examples/hybrid.f90 /nfs/admin/slurm/ops/slurm-examples/hybrid.c

#!/bin/bash
# FORTRAN
# INTEL
#ml iimpi/2022b
#icx -c help_fortran_find_core_id.c
#mpiifort -qopenmp -o hybrid ./hybrid.f90 help_fortran_find_core_id.o
# GNU
#ml gompi/2022b
#gcc -c help_fortran_find_core_id.c
#mpif90 -fopenmp -ffree-line-length-256 -o hybrid ./hybrid.f90 help_fortran_find_core_id.o

# C
ml iimpi/2022b
mpiicc -cc=icx -qopenmp -Wimplicit-function-declaration -o hybrid ./hybrid.c
# GNU
#ml gompi/2022b
#mpicc -fopenmp o hybrid ./hybrid.c

/nfs/admin/slurm/ops/slurm-examples/MPIOpenMP.sh

#!/bin/bash
#SBATCH -J MPIOpenMP -o %x-%J.out
#SBATCH -t 00:20:00
#SBATCH -n 8 --ntasks-per-node=4 -c 8

# Desactivar UCX (usa TCP y shared-memory)
# export OMPI_MCA_pml=ob1
# export OMPI_MCA_btl=self,vader,tcp

# Indicar explícitamente el interfaz de red (Solo necesario en la particion geocean)
# https://www.mail-archive.com/search?l=users@lists.open-mpi.org&q=subject:%22Re\%3A+\[OMPI+users\]+Unable+to+run+complicated+MPI+Program%22&o=newest&f=1
# export OMPI_MCA_btl_tcp_if_include=eth0

ml iimpi/2022b
#ml gompi/2022b
srun --cpu_bind=verbose ./hybrid

==💡 Use the –exclusive option to get full node allocation.==

If we use with full node allocation (sbatch –exclusive):

[ user@ui ~ ]$ srun --cpu_bind=verbose,mask_cpu:
0x000000ff,
0x0000ff00,
0x00ff0000,
0xff000000
./hybrid

Job array example

/nfs/admin/slurm/ops/slurm-examples/JobArray.sh

#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job array script
#-----------------------------------------------------------------
#SBATCH -J jobarray # Job name
#SBATCH -o %x-%A-%a.out # Specify stdout output file (%A expands to array jobId, %a expands to array task id)
\# SBATCH -a 1,6,16-32
#SBATCH --array=0-15:4
#SBATCH -n 1  # Total number of tasks and memory per core
#SBATCH -t 00:10:00 # Run time (hh:mm:ss) - 1.5 hours

# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
# module load ...

# run the task:
./task.sh

GPU job example - ==NO GPU==

/nfs/admin/slurm/ops/slurm-examples/compileCUDA.sh

#!/bin/bash
#cuda samples repo needed:
#git clone -b v12.1 https://github.com/NVIDIA/cuda-samples.git
#Compilation should be done on a gpu node:
#salloc -c 8 --gpus 1
ml CUDA/12.1.1 GCC
nvcc -o deviceQuery_cuda12 -I./cuda-samples/Common \
cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery.cpp    
````

`/nfs/admin/slurm/ops/slurm-examples/CUDA.sh`

```bash
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run CUDA applications
#----------------------------------------------------
#SBATCH -J gpu_job                # Job name
#SBATCH -o gpu_job.o%j            # Name of stdout output file(%j expands to jobId)
#SBATCH -e gpu_job.o%j            # Name of stderr output file(%j expands to jobId)
#SBATCH -c 32 –mem-per-cpu=2G     # Cores per task requested (1 task job)
#SBATCH --gpus=nvidia_h200:2      # Options for requesting 2 GPUs
#SBATCH -t 01:30:00               # Run time (hh:mm:ss) - 1.5 hours

# Run the CUDA application
module load CUDA/12.1.1
./deviceQuery_cuda12

Enroot + Pyxis example

How to use Enroot

Pull the image from docker hub

[ user@ui ~ ]$ enroot import docker://ubuntu

Create the container

[ user@ui ~ ]$ enroot create library+ubuntu+latest.sqsh

Start the container

[ user@ui ~ ]$ enroot start library+ubuntu+latest

If you need to run something as root inside the container, you can use the --root option.

[ user@ui ~ ]$ enroot start --root library+ubuntu+latest

List the existing containers

[ user@ui ~ ]$ enroot list -f

Remove a container

[ user@ui ~ ]$ enroot remove library+ubuntu+latest

How to use Pyxis

Run a command on a node

[ user@ui ~ ]$ srun -p main cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"

Run the same command, but now inside of a container

[ user@ui ~ ]$ srun -p main --container-image=$(pwd)/library+ubuntu+latest.sqsh --container-name=ubuntu cat /etc/os-release
…
PRETTY_NAME="Ubuntu 24.04.3 LTS"
…

Mount a file from the host and run the command on it, from inside the container

[ user@ui ~ ]$ srun -p main --container-name=ubuntu --container-mounts=/etc/os-release:/etc/os-release cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"

To see more options

[ user@ui ~ ]$ srun --help | grep container

Execute the sbatch script inside a container image, a real application example GROMACS

#!/bin/bash
#SBATCH -p main -t 30:00
#SBATCH --container-mounts /var/spool/slurm,/nfs/home/<group>/<username>/slurm-sci/stmv:/host_pwd
#SBATCH --container-workdir=/host_pwd
#SBATCH --container-image nvcr.io\#hpc/gromacs:2021.3
#SBATCH --container-image  /nfs/home/<group>/<username>/slurm-sci/hpc+gromacs+2021.3.sqsh
#SBATCH --container-name hpc+gromacs+2021.3
export GMX_ENABLE_DIRECT_GPU_COMM=1
/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123