Unity
Unity
About
News
Events
Docs
Contact Us
code
search
login
Unity
Unity
About
News
Events
Docs
Contact Us
dark_mode
light_mode
code login
search

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • CPU Summary List
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • bloom
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • FLUX.1 Kontext
      • fomo
      • gemma
      • glm
      • gte-Qwen2
      • HiDream-I1
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • internLM
      • internvl3-8b-hf
      • intfloat
      • kinetics
      • lg
      • linq
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • llm-compiler
      • Lumina
      • mims
      • mixtral
      • monai
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • Perplexity AI
      • phi
      • playgroundai
      • pythia
      • qwen
      • rag-sequence-nq
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • Databases for ColabFold
      • dfam
      • EggNOG - version 5.0
      • EggNOG - version 6.0
      • EVcouplings databases
      • Genomes from NCBI RefSeq database
      • GMAP-GSNAP database (human genome)
      • GTDB
      • Illumina iGenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • Parameters of AlphaFold
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • PDB70
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets

Documentation

  • Requesting An Account
  • Get Started
    • Quick Start
    • Common Terms
    • HPC Resources
    • Theory of HPC
      • Overview of threads, cores, and sockets in Slurm for HPC workflows
    • Git Guide
  • Connecting to Unity
    • SSH
    • Unity OnDemand
    • Connecting to Desktop VS Code
  • Get Help
    • Frequently Asked Questions
    • How to Ask for Help
    • Troubleshooting
  • Cluster Specifications
    • Node List
    • Partition List
    • Storage
    • Node Features (Constraints)
      • NVLink and NVSwitch
    • CPU Summary List
    • GPU Summary List
  • Managing Files
    • Command Line Interface (CLI)
    • Disk Quotas
    • FileZilla
    • Globus
    • Scratch: HPC Workspace
    • Unity OnDemand File Browser
  • Submitting Jobs
    • Batch Jobs
      • Array Batch Jobs
      • Large Job Counts
      • Monitor a batch job
    • Helper Scripts
    • Interactive CLI Jobs
    • Unity OnDemand
    • Message Passing Interface (MPI)
    • Slurm cheat sheet
  • Software Management
    • Building Software from Scratch
    • Conda
    • Modules
      • Module Usage
    • Renv
    • Unity OnDemand
      • JupyterLab OnDemand
    • Venv
  • Tools & Software
    • ColabFold
    • R
      • R Parallelization
    • Unity GPUs
  • Datasets
    • AI and ML
      • AlpacaFarm
      • audioset
      • bigcode
      • biomed_clip
      • blip_2
      • bloom
      • coco
      • Code Llama
      • DeepAccident
      • DeepSeek
      • DINO v2
      • epic-kitchens
      • florence
      • FLUX.1 Kontext
      • fomo
      • gemma
      • glm
      • gte-Qwen2
      • HiDream-I1
      • ibm-granite
      • Idefics2
      • Imagenet 1K
      • inaturalist
      • infly
      • internLM
      • internvl3-8b-hf
      • intfloat
      • kinetics
      • lg
      • linq
      • Llama2
      • llama3
      • llama4
      • Llava_OneVision
      • llm-compiler
      • Lumina
      • mims
      • mixtral
      • monai
      • msmarco
      • natural-questions
      • objaverse
      • openai-whisper
      • Perplexity AI
      • phi
      • playgroundai
      • pythia
      • qwen
      • rag-sequence-nq
      • s1-32B
      • satlas_pretrain
      • scalabilityai
      • sft
      • SlimPajama
      • t5
      • Tulu
      • V2X
      • video-MAE
      • vit
      • wildchat
    • Bioinformatics
      • AlphaFold3 Databases
      • BFD/MGnify
      • Big Fantastic Database
      • checkm
      • ColabFoldDB
      • Databases for ColabFold
      • dfam
      • EggNOG - version 5.0
      • EggNOG - version 6.0
      • EVcouplings databases
      • Genomes from NCBI RefSeq database
      • GMAP-GSNAP database (human genome)
      • GTDB
      • Illumina iGenomes
      • Kraken2
      • MGnify
      • NCBI BLAST databases
      • NCBI RefSeq database
      • Parameters of AlphaFold
      • Parameters of Evolutionary Scale Modeling (ESM) models
      • PDB70
      • PINDER
      • PLINDER
      • Protein Data Bank
      • Protein Data Bank database in mmCIF format
      • Protein Data Bank database in SEQRES records
      • Tara Oceans 18S amplicon
      • Tara Oceans MATOU gene catalog
      • Tara Oceans MGT transcriptomes
      • Uniclust30
      • UniProtKB
      • UniRef100
      • UniRef30
      • UniRef90
      • Updated databases for ColabFold
    • Using HuggingFace Datasets

On this page

  • What is MPI?
  • Determine if your software support MPI
  • MPI implementations
  • General compiling instructions
  • General usage instructions
    • Shaping your job
    • Slurm’s srun
    • OpenMPI mpirun or mpiexec
    • MPICH mpiexec
    • Intel mpiexec
  • Other variables
  1. Unity
  2. Documentation
  3. Submitting Jobs
  4. Message Passing Interface (MPI)

Use MPI for multi-node job execution

The following guide provides an overview of the message passing interface (MPI) and how to use MPI to run jobs across multiple nodes. See also our discussion on how MPI jobs are mapped to resources in this Overview.

What is MPI?

The Message Passing Interface (MPI) provides a standard way to pass around low level variables between processes within a node or between nodes. Almost all languages have support for it, but it is most frequently used in C/C++ and Fortran code, and to a lesser extent with mpi4py.

Determine if your software support MPI

When using existing software, review its documentation to see if it supports MPI. If it doesn’t mention MPI explicitly, it probably doesn’t support it (with few exceptions).

If you are compiling software yourself, look at the options of ./configure --help or cmake -LAH to find MPI related settings. See the Software Installation Guide for more information on how to build software from scratch.

If you are doing ML/AI jobs using pytorch, you can use torchrun to do a distributed launch, but make sure your code using torch.distributed appropriately (see PyTorch Distributed Overview for more information).

MPI implementations

Since MPI only defines the interface for passing messages, there are many possible implementations. Among these, there are a few major implementations that are widely used:

  • OpenMPI - The most common implementation with an open source license (BSD-3)
  • MPICH - Another widely used implementation with a permissive license - many other MPI stacks derive from this
  • Intel MPI - Shipped with the Intel Compiler (now branded with Intel oneAPI) - also derived from MPICH
  • MVAPICH - Multiple purpose-built implementations

In theory, any MPI program should build against any of these implementations. However, once built against a given implementation, the software must run with the same support libraries. For example, it is not possible to compile against OpenMPI and then run using Intel’s MPI implementation. The program generally either crashes or exits quickly with an error when it fails under this condition.

General compiling instructions

When compiling an MPI program, you may need to specify a different name for the compiler, such as:

LanguageGNU CompilerIntel CompilerGNU MPIIntel MPI
Ccc/gcciccmpiccmpiicc
C++c++/g++icpc/icxmpic++mpiicpc/mpicxx
Fortrangfortranifort/ifxmpif77/mpif90/mpifortmpiifort/mpifc

The mechanism for selecting the compiler varies by software.

General usage instructions

The configuration of an MPI stack varies significantly based the MPI library’s compile-time options, so there isn’t a one-size-fits-all set of usage instructions. However, the following sections guide you through how to start an MPI program based on some common use cases.

Shaping your job

When chosing the shape of your job in terms of the number of nodes, cores, tasks, gpus, etc, there are many sbatch options that seem similar, but affect how the resources are allocated and how the job runs, and the interactions between them are complex. Here are some common configurations.

lightbulb
No matter which options you choose, you should verify your job is scaling the way you intend by checking the job efficiency, or monitoring it during a run. See how to Monitor a batch job for details on how.

Single node, single process, single core

This is actually the default configuration if no other resources are requested. This is most common for non-MPI applications.

#SBATCH --nodes=1 --ntasks=1 --cpus=1
yourprog

Single node, single process, multiple cores

This configuration works for processes that support threads, or even OpenMP (different from OpenMPI, OpenMP is meant for creating lightweight threads).

#SBATCH --nodes=1 --ntasks=1 --cpus=8
# If the job supports OpenMP, set this:
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
yourprog

You can also request an entire node and scale your job to the size of the node:

#SBATCH --nodes=1 --ntasks=1 --exclusive --mem=250g
# If the job supports OpenMP, but running one procress, set this:
export OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE
# See if your program has a way to specify threads:
yourprog --threads $SLURM_CPUS_ON_NODE

Despite the name, SLURM_CPUS_ON_NODE scales to the number of cores your job is allocated on the node, not the number of physical cores the node has.

Note that while --mem=0 gives the job all memory on a node, it doesn’t let you sepcify a minimum, so it’s not recommended. See the node list, cpu summary and gpu summary pages for memory per node or core information.

Single node, multiple processes

For programs that support MPI, usually they run as mutliple processes. Sometimes they support “hybrid” mode which means they can run both multiple processes and mulitiple threads per process. See your software’s documentation for how to find the correct balance.

#SBATCH --nodes=1 --ntasks=10 --exclusive --mem=250g
# If the application supports OpenMP, use 1 thread only
# unless also using --cpus-per-task, # in which case use
# OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=1
mpirun yourprog

See below for guidance on srun, mpirun and mpiexec, but note that you must use one of these to launch multiple processes on the same node.

Multiple nodes, multiple processes

For best performance with well scaling code, it usually best to ask for complete nodes with --exclusive. See the node list, cpu summary and gpu summary pages for cores per node information, and which partitions they reside in.

#SBATCH --nodes=4 --ntasks-per-node=64 --exclusive --mem-per-cpu=2g
#SBATCH --constraint=ib # Use this for "tightly coupled" jobs
export OMP_NUM_THREADS=1
mpirun yourprog

See below for guidance on srun, mpirun and mpiexec, but note that you must use one of these to launch across mulitple nodes.

Unspecified nodes, multiple processes

While it’s possible to specify the number of tasks without saying how many nodes to spread across, this can lead to high communications overhead, so isn’t recommended for “tightly coupled” jobs like simulations tend to be. Note that these may also benefit from --constraint=ib for a low latency network, however the mpi constraint itself doesn’t imply this. They can be combined with --constraint=mpi&ib.

#SBATCH --ntasks=256 --mem-per-cpu=2g --constraint=mpi
export OMP_NUM_THREADS=1
mpirun yourprog

Key takeaways

  • --ntasks (and variants) control how many processes are created. Prefer also specifying --nodes with this option.
  • --nodes controls the maximum number of nodes allocated to the job. If --ntasks isn’t specified, then --ntasks-per-node=1 is assumed.
  • --cpus-per-task, -c maps to threads, not processes. For most MPI jobs this should be 1, however some software has “hybrid” modes that support more than one (either a finite number like 2-4, or --cores-per-socket, although there is no SLURM_* varaible for this number, so you’ll have to target a particular CPU type, see Node Features)

More coverage of these options is available in the sbatch manpage.

Slurm’s srun

Slurm provides the srun command to start an MPI program (it can also run non-MPI programs if you want extra accounting via sacct). Generally, srun doesn’t require any extra parameters or environment variables to run; however, be aware that it does inherit its information via the $SLURM_* variables that sbatch sets. If you use #SBATCH --export=NONE in particular, then you may need to do srun --export=ALL if your program relies on environment variables being set.

OptionDescription
--labelPrefix each line with the Rank it came from

OpenMPI mpirun or mpiexec

If you are using a Slurm aware OpenMPI (orte-info | grep plm: slurm to find out), and mpirun is available (some installs insist on srun), then running it directly without any parameters should suffice. The more standard mpiexec is also available.

OpenMPI has many parameters that affect how it runs. See their Fine Tuning Guide for all the details.

OptionDescription
--tag-outputPrefix each line with the Job and Rank it came from
--timestamp-outputPrefix lines with the current timestamp

MPICH mpiexec

mpiexec should not require any parameters.

OptionDescription
-prepend-rankPrefix each line with the Rank it came from

Intel mpiexec

mpiexec should not require any parameters.

OptionDescription
-prepend-rankPrefix each line with the Rank it came from

It’s also possible to use srun with Intel MPI; in that case, you should set the following:

export I_MPI_PMI_LIBRARY=/usr/lib/x86_64-linux-gnu/libpmi2.so

Other variables

Some MPI implementations allow you to use environment variables to control and customize their behavior. Generally, if these MPI variables aren’t explicitly set, the libraries will auto-detect the environment. However, there are some edge cases where manually setting these environment variables may be beneficial.

Libfabric based (fi_info)

VariableValueMeaning
FI_PROVIDERtcpUse regular communication between nodes; should always work
FI_PROVIDERshmOnly use shared memory; only works with --nodes=1
FI_PROVIDERverbsUse InfiniBand low latency network; only works with --constraint=ib
FI_PROVIDERshm,verbsSeparate multiple providers with a comma

UCX based (ucx_info)

VariableValueMeaning
UCX_TLStcpUse regular communication between nodes; should always work
UCX_TLSshmOnly use shared memory; only works with --nodes=1
UCX_TLScudaUse CUDA support; only works with --gpu=<X>
UCX_TLSrc_xUse InfiniBand low latency network; only works with --constraint=ib
UCX_TLSshm,rc_xSeparate multiple providers with a comma
Last modified: Thursday, September 25, 2025 at 4:13 PM. See the commit on GitLab.
University of Massachusetts Amherst University of Massachusetts Amherst University of Rhode Island University of Rhode Island University of Massachusetts Dartmouth University of Massachusetts Dartmouth University of Massachusetts Lowell University of Massachusetts Lowell University of Massachusetts Boston University of Massachusetts Boston Mount Holyoke College Mount Holyoke College Smith College Smith College Olin College of Engineering Olin College of Engineering
search
close