Running MPI jobs on the cluster

User Guide


1. Overview

The cluster provides optimized OpenMPI installations for each node type. The correct version is loaded automatically when you load the openmpi module on a worker node — you do not need to worry about which build is used.

The system will attempt to select the best available transport automatically. On RDMA-capable nodes this results in UCX (RDMA), while on Ethernet-only nodes OpenMPI falls back to OB1 (TCP).

Partition Nodes Network MPI Transport
meteo_long wn051-058 100GbE RoCE (RDMA) UCX / mlx5
geocean geocean02-06 1GbE Ethernet OB1 / TCP
geocean_priority geocean01,geocean07-08 1GbE Ethernet OB1 / TCP
gtfe_8 wn061-64 1GbE Ethernet OB1 / TCP
gtfe_20 wn065-067 1GbE Ethernet OB1 / TCP
citimac_i12 citimac01-12 1GbE Ethernet OB1 / TCP
citimac_i20 citimac13-25 1GbE Ethernet OB1 / TCP
citimac_i32 citimac26-29 1GbE Ethernet OB1 / TCP
apye_i512 citimac30-31 1GbE Ethernet OB1 / TCP

RDMA partitions provide significantly lower latency and higher bandwidth.


2. Available Modules

Every working node in any of the above partitions has its own arhitecture-optimized openMPI module available.

To see the available versions on any node/partition:

# Open an interactive session on a node
srun --partition=partition_name --pty bash

# List available OpenMPI modules
module avail openmpi

# Load OpenMPI (also loads UCX automatically)
module load openmpi-4.1.7

# Check what is loaded
module list

# Check the OpenMPI build details
ompi_info | grep -E "MCA pml|MCA btl|MCA osc|prefix"

From the login node (ui.sci.unican.es) you can see all the modules for every architecture by executing the following commands:

module use /nfs/software/sci/spack/modulefiles/*
module avail

The module naming convention is:

openmpi-<version>/<compiler>-<compiler_version>-ucx-slurm

For example: openmpi-4.1.7/gcc-13.3.0-ucx-slurm


3. Writing a Slurm Job Script

3.1 Basic MPI Job Template

#!/bin/bash
#SBATCH --job-name=my_mpi_job
#SBATCH --partition=partition_name    # choose your partition
#SBATCH --nodes=2                     # number of nodes
#SBATCH --ntasks=2                    # total MPI ranks
#SBATCH --ntasks-per-node=1           # ranks per node
#SBATCH --time=01:00:00               # wall time limit HH:MM:SS
#SBATCH --output=%j-%x.out            # stdout: <jobid>-<jobname>.out
#SBATCH --error=%j-%x.err             # stderr

# Load the architecture-optimized openMPI environment
source /etc/profile.d/modules.sh
module load openmpi-4.1.7

# Optional: show info about openMPI transport selection
# export OMPI_MCA_pml_base_verbose=20
# export OMPI_MCA_btl_base_verbose=20
# On RDMA capable nodes/partitions
# export UCX_LOG_LEVEL=info

# Run your MPI application
srun ./my_mpi_application [arguments]

4. Example: OSU Micro-Benchmarks

The OSU Micro-Benchmarks are a standard suite for measuring MPI performance. They are available as a module and are useful both as a usage example and as a sanity check after loading OpenMPI.

4.1 Job Script: Point-to-Point Bandwidth

This script runs the osu_get_bw (one-sided get bandwidth) benchmark between two nodes and saves the results with structured filenames for easy comparison across partitions.

#!/bin/bash
#SBATCH --job-name=osu_bw
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:10:00
#SBATCH --output=%j-%x.out

source /etc/profile.d/modules.sh
module load openmpi-4.1.7
module load osu-micro-benchmarks-7.5

OSUBENCH="osu_get_bw"
OUTDIR="${SLURM_JOBID}-${SLURM_JOB_PARTITION}"
mkdir -p ${OUTDIR}

srun ${OSUBENCH} \
  > ${OUTDIR}/openmpi4-${SLURM_JOB_PARTITION}-${SLURM_JOB_NODELIST}-${SLURM_JOBID}-${OSUBENCH}.txt \
  2>&1

echo "Done. Results in ${OUTDIR}/"

4.2 Submit the job

sbatch --partition=${partition_name} run_osu.sh

4.3 Available Benchmarks

Point-to-point:
  osu_latency      latency between 2 ranks (lower is better)
  osu_bw           send/receive bandwidth (higher is better)
  osu_get_bw       one-sided get bandwidth (RDMA-friendly)

Collective operations:
  osu_allreduce    MPI_Allreduce latency (key for scientific codes)
  osu_barrier      MPI_Barrier latency
  osu_bcast        MPI_Bcast latency

5. Checking Job Results

5.1 Monitor a Running Job

# Check job status
squeue -u $USER
squeue -j <JOBID>

# Which nodes were allocated
squeue -j <JOBID> -o "%N"

# Detailed job information
scontrol show job <JOBID>

5.2 Read the Output

A successful osu_get_bw run looks like this:

# OSU MPI_Get Bandwidth Test v7.5
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       0.24
2                       0.50
4                       0.96
...
1048576               117.60
2097152               117.54

The benchmark completed correctly if you see a full table of message sizes and bandwidth values. An incomplete table or no output means the job failed before finishing.

5.3 Check Which Transport Was Used

You can verify that OpenMPI selected the correct transport by adding verbosity to your job script before srun:

export OMPI_MCA_pml_base_verbose=10
export OMPI_MCA_btl_base_verbose=10

In the output, look for:

# TCP nodes — expected:
selected ob1 best priority 20
select: component ob1 selected

# RDMA nodes — expected:
select: initializing pml component ucx
select: component ucx selected

5.4 Check Resource Usage After the Job

# Accounting summary (after job completes)
sacct -j <JOBID> --format=JobID,JobName,Partition,NodeList,Elapsed,CPUTime,State

# Efficiency report — CPU and memory utilisation
seff <JOBID>

seff tells you whether your job used the allocated CPUs and memory efficiently. Low CPU efficiency may indicate your code is not scaling well across the requested ranks.