Running MPI jobs on the cluster
User Guide
1. Overview
The cluster provides optimized OpenMPI installations for each node type. The correct
version is loaded automatically when you load the openmpi module on a worker node —
you do not need to worry about which build is used.
The system will attempt to select the best available transport automatically. On RDMA-capable nodes this results in UCX (RDMA), while on Ethernet-only nodes OpenMPI falls back to OB1 (TCP).
| Partition | Nodes | Network | MPI Transport |
|---|---|---|---|
meteo_long |
wn051-058 | 100GbE RoCE (RDMA) | UCX / mlx5 |
geocean |
geocean02-06 | 1GbE Ethernet | OB1 / TCP |
geocean_priority |
geocean01,geocean07-08 | 1GbE Ethernet | OB1 / TCP |
gtfe_8 |
wn061-64 | 1GbE Ethernet | OB1 / TCP |
gtfe_20 |
wn065-067 | 1GbE Ethernet | OB1 / TCP |
citimac_i12 |
citimac01-12 | 1GbE Ethernet | OB1 / TCP |
citimac_i20 |
citimac13-25 | 1GbE Ethernet | OB1 / TCP |
citimac_i32 |
citimac26-29 | 1GbE Ethernet | OB1 / TCP |
apye_i512 |
citimac30-31 | 1GbE Ethernet | OB1 / TCP |
RDMA partitions provide significantly lower latency and higher bandwidth.
2. Available Modules
Every working node in any of the above partitions has its own arhitecture-optimized openMPI module available.
To see the available versions on any node/partition:
# Open an interactive session on a node
srun --partition=partition_name --pty bash
# List available OpenMPI modules
module avail openmpi
# Load OpenMPI (also loads UCX automatically)
module load openmpi-4.1.7
# Check what is loaded
module list
# Check the OpenMPI build details
ompi_info | grep -E "MCA pml|MCA btl|MCA osc|prefix"
From the login node (ui.sci.unican.es) you can see all the modules for every architecture by executing the following commands:
module use /nfs/software/sci/spack/modulefiles/*
module avail
The module naming convention is:
openmpi-<version>/<compiler>-<compiler_version>-ucx-slurm
For example: openmpi-4.1.7/gcc-13.3.0-ucx-slurm
3. Writing a Slurm Job Script
3.1 Basic MPI Job Template
#!/bin/bash
#SBATCH --job-name=my_mpi_job
#SBATCH --partition=partition_name # choose your partition
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks=2 # total MPI ranks
#SBATCH --ntasks-per-node=1 # ranks per node
#SBATCH --time=01:00:00 # wall time limit HH:MM:SS
#SBATCH --output=%j-%x.out # stdout: <jobid>-<jobname>.out
#SBATCH --error=%j-%x.err # stderr
# Load the architecture-optimized openMPI environment
source /etc/profile.d/modules.sh
module load openmpi-4.1.7
# Optional: show info about openMPI transport selection
# export OMPI_MCA_pml_base_verbose=20
# export OMPI_MCA_btl_base_verbose=20
# On RDMA capable nodes/partitions
# export UCX_LOG_LEVEL=info
# Run your MPI application
srun ./my_mpi_application [arguments]
4. Example: OSU Micro-Benchmarks
The OSU Micro-Benchmarks are a standard suite for measuring MPI performance. They are available as a module and are useful both as a usage example and as a sanity check after loading OpenMPI.
4.1 Job Script: Point-to-Point Bandwidth
This script runs the osu_get_bw (one-sided get bandwidth) benchmark between two nodes
and saves the results with structured filenames for easy comparison across partitions.
#!/bin/bash
#SBATCH --job-name=osu_bw
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --output=%j-%x.out
source /etc/profile.d/modules.sh
module load openmpi-4.1.7
module load osu-micro-benchmarks-7.5
OSUBENCH="osu_get_bw"
OUTDIR="${SLURM_JOBID}-${SLURM_JOB_PARTITION}"
mkdir -p ${OUTDIR}
srun ${OSUBENCH} \
> ${OUTDIR}/openmpi4-${SLURM_JOB_PARTITION}-${SLURM_JOB_NODELIST}-${SLURM_JOBID}-${OSUBENCH}.txt \
2>&1
echo "Done. Results in ${OUTDIR}/"
4.2 Submit the job
sbatch --partition=${partition_name} run_osu.sh
4.3 Available Benchmarks
Point-to-point:
osu_latency latency between 2 ranks (lower is better)
osu_bw send/receive bandwidth (higher is better)
osu_get_bw one-sided get bandwidth (RDMA-friendly)
Collective operations:
osu_allreduce MPI_Allreduce latency (key for scientific codes)
osu_barrier MPI_Barrier latency
osu_bcast MPI_Bcast latency
5. Checking Job Results
5.1 Monitor a Running Job
# Check job status
squeue -u $USER
squeue -j <JOBID>
# Which nodes were allocated
squeue -j <JOBID> -o "%N"
# Detailed job information
scontrol show job <JOBID>
5.2 Read the Output
A successful osu_get_bw run looks like this:
# OSU MPI_Get Bandwidth Test v7.5
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 0.24
2 0.50
4 0.96
...
1048576 117.60
2097152 117.54
The benchmark completed correctly if you see a full table of message sizes and bandwidth values. An incomplete table or no output means the job failed before finishing.
5.3 Check Which Transport Was Used
You can verify that OpenMPI selected the correct transport by adding verbosity to
your job script before srun:
export OMPI_MCA_pml_base_verbose=10
export OMPI_MCA_btl_base_verbose=10
In the output, look for:
# TCP nodes — expected:
selected ob1 best priority 20
select: component ob1 selected
# RDMA nodes — expected:
select: initializing pml component ucx
select: component ucx selected
5.4 Check Resource Usage After the Job
# Accounting summary (after job completes)
sacct -j <JOBID> --format=JobID,JobName,Partition,NodeList,Elapsed,CPUTime,State
# Efficiency report — CPU and memory utilisation
seff <JOBID>
seff tells you whether your job used the allocated CPUs and memory efficiently.
Low CPU efficiency may indicate your code is not scaling well across the requested ranks.