LAMMPS - MPI issue with partitions for NEB

Hi!

Sorry to bother. I’m trying to run NEB simulations on LAMMPS on a cluster through a bash script via the following command:

mpirun -np 32 /path/to/lammps/build/lmp -partition 8x4 -in in.neb

but I keep getting the following error:

ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)

I am not sure what I am doing wrong. On my laptop, with a simple input file, the partition command just like this works. I feel like LAMMPS is passing the wrong configuration to MPI, or vice-versa, I am not sure.

I saw a similar error on a forum and the answer was referring to the potential discrepancy between mpi versions, so I re-compiled LAMMPS over an over again just to make sure that wasn’t the issue.

I am just unsure on what my possibilities are. An input would be much appreciated.

Thank you very much in advance!

That is not likely going to help. If you are doing the same thing again, a compute will do the same thing again, and thus you will get the same error again.

Since you are running on a cluster, the problem is not likely happening during compilation, but during the run. For example, if you load your MPI library from an environment module, you have to make certain that the exact same environment module is loaded within your batch script.

There is not enough information here to make a more specific assessment. It would be helpful to see the full output from your cluster execution and what the exact LAMMPS version is compared to the version you use on your laptop.

First of all, thanks a lot for for your answer.

You’re right, I meant that I was re-compiling changing small things at once, to see what the effect of BUILD_MPI=yes is, or if including python has an impact, etc. I don’t have a lot of experience with this stuff, so sometimes I just try things!

In terms of modules, I load the same exact environment module in my batch script (openmpi-4.0.5).

The run produces two outputs: an .err file and an .out file; the .err file reads:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50048,1],13]
  Exit code:    1
--------------------------------------------------------------------------

and the .out file reads:

ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)

with the number of lines depending on the way I try to partition the cores. I would not compare my LAMMPS version with the one I use on my laptop, because I change many things (including the potential I am using). I was just mentioning the fact that my partition command looks correct to me in terms of syntax, because it works in other cases.

If I understand well, --oversubscrib(ing) MPI could help, but I’m not sure this is the problem here, because when I try it, it still fails. If you had any other advice, I’d be very grateful.

Thanks again for your help, I appreciate it very much!

The fact that you get the error message multiple times, is an indication of a mismatch of the compilation and runtime MPI library (or at least the mpirun command does not match).
There also need to be LAMMPS log/output files (usually log.lammps* and multiple screen.* files) and those contain crucial information.

I think you’re correct, but I don’t really know where to start to solve it. Unfortunately in this setting I don’t even see the log.lammps* and screen.* files, as the run crashes as soon as it reaches this command.

I’ll keep trying to debug this, and hopefully I’ll figure out something helpful.

If you compile the very latest patch release of LAMMPS (15 Sep 2022), there is now a -nonbuf (or -nb) command line flag, which will turn off buffering completely. This was added exactly for your situation to retain output for calculations that error out early.

Thanks a lot for the suggestion. Unfortunately, it does not change much for me, and the run crashes as soon as the -partition command is reached, so I still get:

ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)

for a command like:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 32
#SBATCH --time=36:00:00
#SBATCH -o %J.out
#SBATCH -e %J.err
#SBATCH -J NEB

module load mpi/openmpi-4.0.5

mpirun -np 32 /path/to/lammps/build/lmp -nonbuf -partition 8x4 -in in.neb

When you mention:

The fact that you get the error message multiple times, is an indication of a mismatch of the compilation and runtime MPI library (or at least the mpirun command does not match).

what would you suggest to solve or even debug this, considering that this is all the output I am getting?

Once again, thanks a lot for your time and for your help.

Please carefully check the instructions of your HPC environment. Since you seem to be using SLURM, using mpirun may not be correct. I would suggest you contact the user support for the cluster, if there is not sufficient information.

You should have similar problems for inputs without the partition flag. So a simple test to confirm would be to run the “melt” example input without the partition flag, but also with 32 mpi processes. That should at least run to completion and produce a log.lammps file.

Good suggestion.

I tried running the melt example with the same command, and it works well:

mpirun -np 32 /path/to/lammps/build/lmp -nonbuf -in in.melt
LAMMPS (15 Sep 2022)
LAMMPS (15 Sep 2022)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
LAMMPS (15 Sep 2022)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
  using 1 OpenMP thread(s) per MPI task
  using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962

[…]

Total # of neighbors = 151788
Ave neighs/atom = 37.947
Neighbor list builds = 12
Dangerous builds not checked
Total wall time: 0:00:06

But if I try the exact same command for NEB from the command line (avoiding slurm):

mpirun -np 32 /path/to/lammps/build/lmp -nonbuf -partition 8x4 -in in.neb
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
ERROR: Processor partitions do not match number of allocated processors (src/lammps.cpp:451)
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[26733,1],31]
  Exit code:    1
--------------------------------------------------------------------------

Which is confusing?

No it does not work well. There should be only one line with the LAMMPS version.
You also cut off too much of the log file. Further down it should show that it not using all 32 MPI processes for the run.

This confirms that the mpirun command you are using is not matching the MPI library that you compiled the LAMMPS executable with and that is commensurate with the partition error.

Not to me.

1 Like

Oh well!

Thanks a lot, this is the demonstration of my lack of experience and understanding. Sorry for taking a lot of your time!

I’ll try to find out which MPI library I used to compile the LAMMPS executable.

Information about the MPI library that was used to compile LAMMPS with will be printed when you run: lmp -h.

For example with MPICH I get:

OS: Linux "Fedora Linux 36 (Thirty Six)" 6.0.5-200.fc36.x86_64 x86_64

Compiler: GNU C++ 12.2.1 20220819 (Red Hat 12.2.1-2) with OpenMP 4.5
C++ standard: C++14
MPI v3.1: MPICH Version:	3.4.3
MPICH Release date:	Thu Dec 16 11:20:57 CST 2021
MPICH ABI:	13:12:1

With the (dummy) MPI stub library in a serial compile I get:

OS: Linux "Fedora Linux 36 (Thirty Six)" 5.19.16-200.fc36.x86_64 x86_64

Compiler: Clang C++ Clang 14.0.5 (Fedora 14.0.5-1.fc36) with OpenMP 5.0
C++ standard: C++11
MPI v1.0: LAMMPS MPI STUBS for LAMMPS version 15 Sep 2022

You were right, LAMMPS was compiled using this MPI library:

OS: Linux "GridOS 18.04.6" 4.14.295-llgrid-10ms x86_64

Compiler: GNU C++ 7.5.0 with OpenMP 4.5
C++ standard: C++11
MPI v3.1: MPICH Version:	3.3.2
MPICH Release date:	Tue Nov 12 21:23:16 CST 2019
MPICH ABI:	13:8:1

whereas I load mpi/openmpi-4.0.5.

Thanks a lot for guiding me through this debug. Because you’re so helpful, the very last question I have is, how do I compile LAMMPS with openmpi-4.0.5? I cannot seem to find an answer here that explains how to specify the MPI library, just the BUILD_MPI flag.

Thank you very much in advance!

If you load the OpenMPI module before you run cmake for the first time and the OpenMPI module has been set up correctly, it should “just work™”. I regularly switch between MPI libraries and never needed to set any extra flags, only that I am using two different build folders. The CMake scripts in LAMMPS assume at least CMake 3.10 and use the standard FindMPI functionality and thus everything written here applies: FindMPI — CMake 3.10.3 Documentation

1 Like

Fantastic.

Thank you very much again, I really appreciate it!

Have a great day.