Limit at MPI tasks launched by LAMMPS

Dear all,

I am facing a problem on a cluster using SLURM, on which I am only able to launch up to 31 tasks per mpirun command. If I try to launch >=32 MPI ntasks I get the following error: sys.c:1560 UCX ERROR pthread_create() failed: Resource temporarily unavailable.

I am following the OpenMPI documentation recommendations and launch the jobs with mpirun instead of srun (10.7. Launching with Slurm — Open MPI 5.0.x documentation) and used the --mca pml ucx flag to ensure UCX is being used.

I checked that MPI is configured to be able to launch more than 31 tasks. I am attaching a file with the full output from the test (slurm.out (14.3 KB)). The same error persists when using srun instead of mpirun. The only difference is that srun also gives a warning before the program executes normally when using <32 tasks:

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      <redacted>
Framework: psec
Component: munge
-------------------------------------------------------------------------- 

I tried disabling btl/uct as suggested in the UXC documentation (Running UCX — OpenUCX documentation) to no avail.

In order to understand better this problem I ceated a simple sbatch script run_sbatch.sh (283 Bytes) that runs the equally simple exec_srun.sh (58 Bytes). By alternating between the srun and mpirun commands in run_sbatch.sh (and respectively OMPI_COMM_WORLD_RANK and SLURM_PROCID in exec_srun.sh) I found that srun could deal normally with the maximum number of tasks per node (104) whereas mpirun would give an error:

--------------------------------------------------------------------------
A request was made to bind that would require binding
processes to more cpus than are available in your allocation:

   Application:     exec_srun.sh
   #processes:      104
   Mapping policy:  BYCORE
   Binding policy:  CORE

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

My tests showed that with mpirun I could launch a maximum of 52 tasks, which probably means that MPI does not recognize Hyperthreading? All this is very confusing as it seems to go against the OpenMPI documentation that suggests using mpirun instead of srun with newer versions of OpenMPI. In my environment I have OpenMPI version 5.0.8 with UCX v1.20.0. (I had to use my own version of OpenMPI as the GNU compilers preinstalled in the cluster are v4.8.5 and do not support C++17).

In any case, I am trying to understand if this problem is specific to LAMMPS, as the simple program (that requires no communication between tasks) runs without problems with srun. I am sorry if I am bothering you with an issue unrelated to LAMMPS, I decided to write here after lots of back and forth with the admins of the cluster that brought no solutions.

Thank you in advance
Christos

It is not a LAMMPS problem.

All the errors are due to the MPI library.

You don’t need to do that. You can just set the OMPI_CC and OMPI_CXX environment variables to the path of the compilers you want to use and the OpenMPI wrappers will happily use those compilers. LAMMPS only requires the C API of MPI and that is the same.

Hyperthreading is not having additional cores, in fact, it has very little to no benefit for running with MPI, so if there is a restriction on physical cores (which the error message indicates), you are indeed oversubscribing your node and OpenMPI is preventing you from doing that. If you want to do it anyway, you have to tell OpenMPI to ignore these settings.

But then again, which decent and self-respecting HPC manager leaves hyperthreading enabled on HPC compute nodes? I don’t know any, and we definitely don’t do this on any of our machines except for the ones dedicated for interactive use where we don’t care about people wasting time and resources.

I assume you mean during LAMMPS installation? I have tried that but CMake does not recognize the existence of MPI if the compilers used by it are not the same ones that have been used to build OpenMPI. I tried forcing it to configure anyway by manually setting BUILD_MPI=yes but then I get the following error during configuration:

Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
 CMake Error at /lustre/home/cpsevdos/miniforge3/envs/lammps/share/cmake-4.2/Modules/FindPackageHandleStandardArgs.cmake:290 (message):
   Could NOT find MPI (missing: MPI_CXX_FOUND CXX)
 Call Stack (most recent call first):
   /lustre/home/cpsevdos/miniforge3/envs/lammps/share/cmake-4.2/Modules/FindPackageHandleStandardArgs.cmake:654 (_FPHSA_FAILURE_MESSAGE)
   /lustre/home/cpsevdos/miniforge3/envs/lammps/share/cmake-4.2/Modules/FindMPI.cmake:2006 (find_package_handle_standard_args)
   CMakeLists.txt:391 (find_package)

 Configuring incomplete, errors occurred!

Or are you implying that I can compile LAMMPS with a specific version of OpenMPI and then run it using a different one? I thought that was completely out of the question but I will give it a shot.

Hyperthreading is enabled in all compute nodes of my cluster, which is why I can launch 104 tasks with srun on a node with 52 physical cores even when explicitly setting --cpus-per-task=1. I have contacted the cluster admins and there is no intention to change that. I guess it is a question of politics, making it look like the cluster can handle more jobs at the same time. I am just curious as to why mpirun will not accept to do the same but that is another question.

In any case, thank you so much for taking the time to answer on a non-LAMMPS issue

Then you are not doing it correctly.
Here is protocol of a CMake session on my Fedora 43 desktop with OpenMPI loaded and both GCC and Clang compilers installed. OpenMPI has been configured with GCC, but I can use it with Clang just fine.

akohlmey@dudu:~/compile/lammps$ mpicxx -v
Using built-in specs.
COLLECT_GCC=/usr/bin/g++
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/15/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,m2,cobol,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugzilla.redhat.com/ --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-libstdcxx-backtrace --with-libstdcxx-zoneinfo=/usr/share/zoneinfo --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-15.2.1-build/gcc-15.2.1-20260123/obj-x86_64-redhat-linux/isl-install --enable-offload-targets=nvptx-none,amdgcn-amdhsa --enable-offload-defaulted --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux --with-build-config=bootstrap-lto --enable-link-serialization=1 --disable-libssp
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.2.1 20260123 (Red Hat 15.2.1-7) (GCC) 
akohlmey@dudu:~/compile/lammps$ env OMPI_CXX=clang++ mpicxx -v
clang version 21.1.8 (Fedora 21.1.8-4.fc43)
Target: x86_64-redhat-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Configuration file: /etc/clang/x86_64-redhat-linux-gnu-clang++.cfg
System configuration file directory: /etc/clang/
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/15
Selected GCC installation: /usr/bin/../lib/gcc/x86_64-redhat-linux/15
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found HIP installation: /usr, version 6.4.43484
akohlmey@dudu:~/compile/lammps$ rm -r build-clang/
akohlmey@dudu:~/compile/lammps$ env CC=clang CXX=clang++ OMPI_CXX=clang++ OMPI_CC=clang cmake -S cmake -B build-clang
CMake Deprecation Warning at CMakeLists.txt:22 (cmake_policy):
  The OLD behavior for policy CMP0109 will be removed from a future version
  of CMake.

  The cmake-policies(7) manual explains that the OLD behaviors of all
  policies are deprecated and that a policy should be set to OLD only under
  specific short-term circumstances.  Projects should be ported to the NEW
  behavior and not rely on setting a policy to OLD.


-- The CXX compiler identification is Clang 21.1.8
-- The C compiler identification is Clang 21.1.8
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/lib64/ccache/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/lib64/ccache/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Git: /usr/bin/git (found version "2.53.0")
-- Running check for auto-generated files from make-based build system
-- Found MPI_CXX: /usr/lib64/openmpi/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1") found components: CXX
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1") found components: CXX
-- Found GZIP: /usr/bin/gzip
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Found Python3: /usr/bin/python3.14 (found version "3.14.2") found components: Interpreter
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * OpenMP
 * Python3

-- <<< Build configuration >>>
   LAMMPS Version:   2025.12.10.99 patch_10Dec2025-950-gd2a6e06f71
   Operating System: Linux Fedora 43
   CMake Version:    3.31.10
   Build type:       RelWithDebInfo
   Install path:     /home/akohlmey/.local
   Generator:        Unix Makefiles using /usr/bin/gmake
-- Enabled packages: <None>
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /usr/lib64/ccache/clang++
      Type:          Clang
      Version:       21.1.8
      C++ Standard:  17
      C++ Flags:     -O2 -g -DNDEBUG
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_GZIP
-- C compiler:       /usr/lib64/ccache/clang
      Type:          Clang
      Version:       21.1.8
      C Flags:       -O2 -g -DNDEBUG
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:    
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /usr/include/openmpi-x86_64
-- MPI libraries:    /usr/lib64/openmpi/lib/libmpi.so;
-- Configuring done (2.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/akohlmey/compile/lammps/build-clang

I am not, but it should indeed work for as long as it has the same major OpenMPI version (and thus the same ABI) and has been compiled as a shared library (I think this cannot be avoided anymore, but I have not needed to compile any MPI library from source for a looooong time).

If they cared about their users, they would not do it, since while you can make it look as if the machine can handle more jobs, those are actually quite a bit slower in most cases. Your admins have probably changed their OpenMPI runtime configuration to accept oversubscription, while your self-compiled version has the default setting. Or they have disabled the “bind-to-core” policy, which also lifts the restriction but with a significant drop in performance due to processes migrating between cores and thus invalidating their CPU caches all the time. If you have nodes with many cores (was it 52?), I would even go so far and use the Linux kernel compute groups feature (cgroups) to only make 50 of them available to users and restrict the remaining 2 to the OS and the root user to reduce so-called OS-jitter. If you want good parallel scaling on many nodes, this is a significant problem, this is why HPC supercomputers from Cray operating with a restricted custom Linux kernel that cannot launch processes of their own and create executables that cannot run on the frontend machines due to requiring a different, static C library. Or back in the time, the IBM Blue/gene supercomputers would have a physical timer device that would forcibly synchronize the Linux kernels to avoid OS-Jitter, i.e. having individual processes being in kernel mode or running root daemons (often just waking them up to determine that there is nothing to do) and this other MPI processes having to wait for them. Once you have a sufficient number of processes, this can add substantial amounts of latency to any kind of MPI communication and thus put limits on the strong parallel scaling.

I am re-categorizing the discussion to Science Talk, where it would not be off-topic.

Ok I think I resolved it. I was following your exact procedure but still MPI was not being recognised. The OpenMPI implementation on our cluster was configured with GNU compilers v4.8.5 and I was trying to compile LAMMPS with GNU v15. As soon as I dropped to v9.3.1 (the minimum to support C++17) MPI was detected wihout problems.

Sadly, that was not enough to solve my problems. I am still getting errors when trying to launch 25+ tasks per cluster (which should be perfectly possible given the number of physical cores per cluster). I also tested with a simple mpi_hello_world script and got similar problems so this is now just between me and the admins :slight_smile: