SLURM SCI User Hand-on
Use
Slurm is the software responsible for managing and allocating the cluster resources when you submit a job. You need to define the job requirements in order to submit a job. The most commonly used parameters are:
- -J JobName: Job name
- --time=DD-HH:MM:SS: The expected time the job will run for (walltime). Format DD=days, HH=hours, MM=minutes, SS= seconds (Default dependent on partition, 1 day for batch partition)
- --mem-per-cpu=MMMM: Memory per CPU core (in MB). Default 2048 (DefMemPerCPU=2048)
- --ntasks=X: Number of MPI tasks. Default 1 task.
- --cpus-per-task=Y: Number of OpenMP threads. Default 1 thread.
- --nodes=Z: Number of nodes. Default 1 node.
- --gpus=[type:]\<number>
--gpus-per-node=[type:]\<number>
--gpus-per-socket=[type:]\<number>
--gpus-per-task=[type:]\<number> GPUs for the job
IMPORTANT: the more accuracy in the job requirements, the more efficient the cluster will be.
You can define those requirements as options of the sbatch command or include them in the submit script headers. Example:
#!/bin/bash
#SBATCH -J JobName
#SBATCH --time=08:00:00 # Walltime
#SBATCH --mem-per-cpu=4096 # memory/cpu (in MB)
#SBATCH --ntasks=2 # 2 tasks
#SBATCH --cpus-per-task=4 # number of cores per task
#SBATCH --nodes=1 # number of nodes
#SBATCH --gpus-per-node=nvidia_h200:2 # gpus h200 per node
Jobs states
| State | Abbreviation | Description |
|---|---|---|
| BOOT_FAIL | BF | Job terminated due to a launch or boot failure, typically caused by hardware issues (e.g., unable to boot a node or block and the job cannot be requeued). |
| CANCELLED | CA | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| COMPLETED | CD | Job has terminated all processes on all nodes with an exit code of zero (successful completion). |
| COMPLETING | CG | Job is in the process of completing. Some processes on some nodes may still be active while SLURM cleans up resources. |
| CONFIGURING | CF | Job is being configured, typically waiting for nodes to be allocated before execution begins. |
| DEADLINE | DL | Job terminated upon reaching its deadline. |
| FAILED | F | Job terminated with a non-zero exit code or experienced another failure condition. |
| NODE_FAIL | NF | Job terminated due to failure of one or more allocated nodes. |
| OUT_OF_MEMORY | OOM | Job was terminated after exceeding its allocated memory. |
| PENDING | PD | Job is awaiting resource allocation. |
| PREEMPTED | PR | Job was terminated (or suspended) due to preemption by another job with higher priority. |
| RUNNING | R | Job currently has an allocation and is executing. |
| RESV_DEL_HOLD | RD | Job is being held because the requested reservation was deleted. |
| REQUEUE_HOLD | RH | Held job is being requeued. |
| REQUEUED | RQ | Completing or failed job has been requeued for another execution attempt. |
| RESIZING | RS | Job is changing its size (e.g., adding or releasing nodes). |
| REVOKED | RV | Job was revoked, typically due to dependency problems or system maintenance. |
| SIGNALING | SI | SLURM is signaling the job (e.g., sending SIGTERM or SIGKILL). |
| SPECIAL_EXIT | SE | Job terminated with a special exit condition defined by the system. |
| STOPPED | ST | Job has an allocation, but execution has been stopped with a SIGSTOP signal. CPUs remain allocated to the job. |
| SUSPENDED | S | Job has an allocation but execution is suspended and CPUs are released for other jobs. |
| TIMEOUT | TO | Job terminated after reaching its time limit. |
Default priority
Slurm gives each job a priority, and works to free up appropriate resources for the highest-priority job. At regular intervals, Slurm will recalculate the priorities of all jobs. The priority could be based on different factors, each one with different weight:
- Fairshare: your job will be given an initial score based on your share and your historical use of the cluster, with your recent use being given more weight.
- Partition priority: the partition priority is defined attending to the job length. So, the longer the job, the less priority it will have.
- Job size: jobs for which (number of cores) / (walltime) is greater get more priority.
- Wait time: a job gets more priority as it waits. This wait-time bonus gradually increases over a certain period of time.
The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:
- Job_priority = site_factor + (PriorityWeightAge) * (age_factor) + (PriorityWeightAssoc) * (assoc_factor) + (PriorityWeightFairshare) * (fair-share_factor) + (PriorityWeightJobSize) * (job_size_factor) + (PriorityWeightPartition) * (partition_factor) + (PriorityWeightQOS) * (QOS_factor) + SUM(TRES_weight_cpu * TRES_factor_cpu, TRES_weight_\
* TRES_factor_\ , ...) - nice_factor
Configured weights:
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 1000
PriorityWeightPartition = 10
PriorityWeightQOS = 1000
Running a job
Most common Slurm commands
- sbatch: submits a jobs script.
- scancel: cancels a running or pending job.
- srun: runs a command across node.
- sbcast: transfers file(s) to the compute nodes allocated for the job.
- sattach: connect stdin/out/err for an existing job or job step
- squeue: displays the job queue
Commonly used Slurm variables
- $SLURM_JOBID (job id)
- $SLURM_JOB_NODELIST (nodes allocated for job)
- $SLURM_NNODES (number of nodes)
- $SLURM_SUBMIT_DIR (directory job was submitted from)
- $SLURM_ARRAY_JOB_ID (job id for the array)
- $SLURM_ARRAY_TASK_ID (job array index value)
Temporary folders
The following temporary folders are created for each job through the job_container/tmpfs plugin that provides job-specific, private temporary file system space. Once the job is completed, the content of those folders is removed. Those temporary folders are meant to be used to perform high IO operations. In order to take advantage of these high performance file systems, you will need to stage in and out the required files.
| File System | Job Path | Real Path |
|---|---|---|
| local disk | /tmp | /scratch/$SLURM_JOBID/.$SLURM_JOBID |
| local memory | /dev/shm | New mounted for every job |
Examples of batch jobs
Let's submit the first batch job with Slurm. We will do it using the sbatch command to interact with Slurm.
[ user@ui ~ ]$ sbatch --partition=ce -t 00:01:00 --wrap "sleep 30; echo hello world"
-
The -t option stands for time and sets a limit on the total run time of the job allocation.
-
If no time limit is defined, the maximum time limit available in the default partition will be applied.
-
The --wrap option means that the following string (in "") will be turned by Slurm in a simple shell script.
Examples of interactive jobs
Interactive job selecting partition (NO MPI jobs).
[ user@ui ~ ]$ export SLURM_MPI_TYPE=none
[ user@ui ~ ]$ srun --mpi=none --partition=ce --pty bash
Interactive job selecting partition and node.
[ user@ui ~ ]$ salloc --partition=ce -w ce210
Interactive job selecting partition, memory per cpu, cores and walltime.
[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=8G -c 4 -t 06:00:00
Interactive job selecting partition, GPU type and count, memory per CPU, cores and walltime.
[ user@ui ~ ]$ salloc --partition=ce --mem-per-cpu=16G -c 4 -t 06:00:00 --gpus=t4:1
Interactive job selecting partition, memory per CPU, cores, walltime, and running Python interactively.
[ user@ui ~ ]$ srun --mpi=none --partition=ce --mem-per-cpu=4G -c 2 -t 06:00:00 srun --pty /bin/bash -c 'python3'
==💡From interactive session to avoid slurm output variables became slurm input variables:== ==unset $(compgen -v | grep "^SLURM")==
Monitoring your work on the cluster
The jobs are scheduled in terms of your relative job priority. The default command to list batch jobs is squeue. Slurm can estimate when the job is going to be scheduled (START_TIME).
[ user@ui ~ ]$ sbatch -t 00:01:00 --wrap "sleep 30; echo hello world"
Submitted batch job 2
[ user@ui ~ ]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 main wrap user R 0:02 1 wn061
- The first column is the job id. * The second column is the total amount of memory used by the job, in kilo-bytes.
- The third column is the requested amount of memory in mega-bytes
- The rows reflect the number of 'srun' commands you used in the script. The first two rows are related to the script used to submit the job, and are rather irrelevant.
- If your application crashes, and the MaxRSS is close to the ReqMem, it might help to increase the requested memory.
- If ReqMem is much larger than MaxRSS, you should request less.
Job efficiency
The seff perl utility is available to check an specific job efficiency:
[ user@ui ~ ]$ seff 2
Job ID: 2
Cluster: sci
User/Group: <username>/<group>
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:31 core-walltime
Job Wall-clock time: 00:00:31
Memory Utilized: 860.00 KB
Memory Efficiency: 0.04% of 2.00 GB (2.00 GB/core)
Job outputs
The default output files are located in the same folder where the job was submitted with the syntax: slurm-$SLURM_JOBID.out and slurm-$SLURM_JOBID.err. You can define your own output files with the following options:
- -o : name of output file. (Example: -o test.out)
- -e : name of error file. (Example: -e test.err)
sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter (e.g. %j):
Do not process any of the replacement symbols.
- %%: The character "%"
- %A: Job array's master job allocation number.
- %a: Job array ID (index) number.
- %J: jobid.stepid of the running job. (e.g. "128.0")
- %j: jobid of the running job.
- %N: short hostname. This will create a separate IO file per node.
- %n: Node identifier relative to current job (e.g. "0" is the first node of the running job) This will create a separate IO file per node.
- %s: stepid of the running job.
- %t: task identifier (rank) relative to current job. This will create a separate IO file per task.
- %u: User name.
- %x: Job name.
Canceling jobs
In order to cancel a pending or running job you can execute the following command:
[ user@ui ~ ]$ scancel [jobid]
If you want to cancel all your jobs, you can use the -u $USER as an option. Example:
[ user@ui ~ ]$ scancel -u hpcnow
Slurm scripts examples
Serial example
/nfs/admin/slurm/ops/slurm-examples/serial.sh
#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job script to run serial applications
#-----------------------------------------------------------------
#SBATCH -J mysimplejob # Job name
#SBATCH -o mysimplejob.%j.out # Specify stdout output file (%j expands to jobId)
# No partition specification needed
#SBATCH -n 1 # Total number of tasks
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
#SBATCH --mem=8G # Total memory demanded - 8 GB RAM
# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
ml hdf5-1.14.5/gcc-8.5.0-q24s5
# Launch the executable named "h5perf_serial"
h5perf_serial
# check the hostname
hostname
Shared memory (OpenMP) example
/nfs/admin/slurm/ops/slurm-examples/compileOpenMP.sh
#!/bin/bash
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi
icx -qopenmp -o omphello ./omphello.c
#module load gcc-13.3.0/gcc-8.5.0-rt6fd
#gcc -fopenmp -o omphello ./omphello.c
/nfs/admin/slurm/ops/slurm-examples/OpenMP.sh
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run OpenMP applications
#----------------------------------------------------
#SBATCH -J openmp_job # Job name
#SBATCH -o openmp_job.o%j # Name of stdout output file(%j expands to jobId)
#SBATCH -e openmp_job.o%j # Name of stderr output file(%j expands to jobId)
#SBATCH -c 8 # Cores per task requested (1 task job)
#SBATCH -t 00:10:00 # Run time (hh:mm:ss) - 10 min
#SBATCH --mem-per-cpu=3G # Memory per core demandes (24 GB in total: 3GB * 8 cores)
# This example will run an OpenMP application using 8 threads
module load intel-oneapi-compilers-2024.2.0/gcc-8.5.0-st4zi
# Run the OpenMP application
./omphello
Distributed memory (MPI) examples
/nfs/admin/slurm/ops/slurm-examples/compileMPI.sh
#!/bin/bash
module load openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq
mpifort -o pi ./pi3f90.f90
/nfs/admin/slurm/ops/slurm-examples/MPI.sh
#!/bin/bash
#----------------------------------------------------
# Generic SLURM script -- MPI Hello World
#
# This script requests 2 nodes and 20 cores/node
# for a total of 40 MPI tasks.
#----------------------------------------------------
#SBATCH -J mpijob # Job name
#SBATCH -o mpijob.%j.out # stdout; %j expands to jobid
#SBATCH -e mpijob.%j.err # stderr; skip to combine stdout and stderr
#SBATCH -N 2 # Number of nodes, not cores (64 cores/node)
#SBATCH -n 40 # Total number of MPI tasks (if omitted, n=N)
#SBATCH --ntasks-per-node=20 # MPI tasks per node
#SBATCH -t 00:30:00 # max time
#SBATCH --mem-per-cpu=1G # memory per core demanded
ml openmpi-5.0.5/gcc-8.5.0-slurm-ogwvq
srun ./pi # Do not add "-n" or "-np" options here. SLURM infers the
# process count from the "-N" and "-n" directives above.
Hybrid example (MPI+OpenMPI)
/nfs/admin/slurm/ops/slurm-examples/compileMPIOpenMP.sh
/nfs/admin/slurm/ops/slurm-examples/help_fortran_find_core_id.c
/nfs/admin/slurm/ops/slurm-examples/hybrid.f90
/nfs/admin/slurm/ops/slurm-examples/hybrid.c
#!/bin/bash
# FORTRAN
# INTEL
#ml iimpi/2022b
#icx -c help_fortran_find_core_id.c
#mpiifort -qopenmp -o hybrid ./hybrid.f90 help_fortran_find_core_id.o
# GNU
#ml gompi/2022b
#gcc -c help_fortran_find_core_id.c
#mpif90 -fopenmp -ffree-line-length-256 -o hybrid ./hybrid.f90 help_fortran_find_core_id.o
# C
ml iimpi/2022b
mpiicc -cc=icx -qopenmp -Wimplicit-function-declaration -o hybrid ./hybrid.c
# GNU
#ml gompi/2022b
#mpicc -fopenmp o hybrid ./hybrid.c
/nfs/admin/slurm/ops/slurm-examples/MPIOpenMP.sh
#!/bin/bash
#SBATCH -J MPIOpenMP -o %x-%J.out
#SBATCH -t 00:20:00
#SBATCH -n 8 --ntasks-per-node=4 -c 8
ml iimpi/2022b
#ml gompi/2022b
srun --cpu_bind=verbose ./hybrid
==💡 Use the –exclusive option to get full node allocation.==
If we use with full node allocation (sbatch –exclusive):
[ user@ui ~ ]$ srun --cpu_bind=verbose,mask_cpu:
0x000000ff,
0x0000ff00,
0x00ff0000,
0xff000000
./hybrid
Job array example
/nfs/admin/slurm/ops/slurm-examples/JobArray.sh
#!/bin/bash
#-----------------------------------------------------------------
# Example SLURM job array script
#-----------------------------------------------------------------
#SBATCH -J jobarray # Job name
#SBATCH -o %x-%A-%a.out # Specify stdout output file (%A expands to array jobId, %a expands to array task id)
\# SBATCH -a 1,6,16-32
#SBATCH --array=0-15:4
#SBATCH -n 1 # Total number of tasks and memory per core
#SBATCH -t 00:10:00 # Run time (hh:mm:ss) - 1.5 hours
# Load any necessary modules
# Loading modules in the script ensures a consistent environment.
# module load ...
# run the task:
./task.sh
GPU job example - ==NO GPU==
/nfs/admin/slurm/ops/slurm-examples/compileCUDA.sh
#!/bin/bash
#cuda samples repo needed:
#git clone -b v12.1 https://github.com/NVIDIA/cuda-samples.git
#Compilation should be done on a gpu node:
#salloc -c 8 --gpus 1
ml CUDA/12.1.1 GCC
nvcc -o deviceQuery_cuda12 -I./cuda-samples/Common \
cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery.cpp
````
`/nfs/admin/slurm/ops/slurm-examples/CUDA.sh`
```bash
#!/bin/bash
#----------------------------------------------------
# Example SLURM job script to run CUDA applications
#----------------------------------------------------
#SBATCH -J gpu_job # Job name
#SBATCH -o gpu_job.o%j # Name of stdout output file(%j expands to jobId)
#SBATCH -e gpu_job.o%j # Name of stderr output file(%j expands to jobId)
#SBATCH -c 32 –mem-per-cpu=2G # Cores per task requested (1 task job)
#SBATCH --gpus=nvidia_h200:2 # Options for requesting 2 GPUs
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
# Run the CUDA application
module load CUDA/12.1.1
./deviceQuery_cuda12
Enroot + Pyxis example
How to use Enroot
Pull the image from docker hub
[ user@ui ~ ]$ enroot import docker://ubuntu
Create the container
[ user@ui ~ ]$ enroot create library+ubuntu+latest.sqsh
Start the container
[ user@ui ~ ]$ enroot start library+ubuntu+latest
If you need to run something as root inside the container, you can use the --root option.
[ user@ui ~ ]$ enroot start --root library+ubuntu+latest
List the existing containers
[ user@ui ~ ]$ enroot list -f
Remove a container
[ user@ui ~ ]$ enroot remove library+ubuntu+latest
How to use Pyxis
Run a command on a node
[ user@ui ~ ]$ srun -p main cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"
Run the same command, but now inside of a container
[ user@ui ~ ]$ srun -p main --container-image=$(pwd)/library+ubuntu+latest.sqsh --container-name=ubuntu cat /etc/os-release
…
PRETTY_NAME="Ubuntu 24.04.3 LTS"
…
Mount a file from the host and run the command on it, from inside the container
[ user@ui ~ ]$ srun -p main --container-name=ubuntu --container-mounts=/etc/os-release:/etc/os-release cat /etc/os-release
…
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"
To see more options
[ user@ui ~ ]$ srun --help | grep container
Execute the sbatch script inside a container image, a real application example GROMACS
#!/bin/bash
#SBATCH -p main -t 30:00
#SBATCH --container-mounts /var/spool/slurm,/nfs/home/<group>/<username>/slurm-sci/stmv:/host_pwd
#SBATCH --container-workdir=/host_pwd
#SBATCH --container-image nvcr.io\#hpc/gromacs:2021.3
#SBATCH --container-image /nfs/home/<group>/<username>/slurm-sci/hpc+gromacs+2021.3.sqsh
#SBATCH --container-name hpc+gromacs+2021.3
export GMX_ENABLE_DIRECT_GPU_COMM=1
/usr/local/gromacs/avx2_256/bin/gmx mdrun -ntmpi 8 -ntomp 16 -nb gpu -pme gpu -npme 1 -update gpu -bonded gpu -nsteps 100000 -resetstep 90000 -noconfout -dlb no -nstlist 300 -pin on -v -gpu_id 0123