Error while running NEB on a multinode cluster

Dear users,

I am facing a peculiar problem while running an NEB calculation on my cluster with several nodes, and 24 cores on each node. I observed that the system was not allowing any arbitrary combination of no. of replicas and the number of cores per replica. Upon probing the issue further, I realized that for reasons unknown to me, the NEB run is allowed only if the number of cores per image is more than the total number of nodes allocated for the computation. For instance, if I use a single node with 24 cores and partition them for 12 images as 12 by 2, it would run fine. However, if I try to run for 24 images with one core per image, i.e., 24 by 1, the job will terminate with an error file showing that the processes have aborted and there is nothing in the screen output files or the log file. Similarly, if I allot 2 nodes with 48 cores in total, I can run with the partition of 16 by 3, but not with 24 by 2 or 48 by 1. It seems that the no. of cores per image must exceed the number of nodes specified in the job submission script.

This is the first time I have encountered this issue. Earlier, I was able to run with whatever partition I wanted, but recently, we switched over to the newer lammps version compiled with the intel oneapi. Any help in this regard will be appreciated.

Best wishes,

Amlan Dutta

When reporting errors, especially unexpected ones, please always report which version of LAMMPS exactly you are using and what version of compilers. This information is usually included in the first part of the output of lmp -h

That indeed makes no sense unless there is a problem with your basic system setup that it cannot handle the system size and requires fewer atoms per processor.

Can you run a simulation for the system without NEB and just one of the replica?

Can you make your input deck available? Or - even better - check with the examples provided with LAMMPS if any of those exhibit the same issue. I don’t see much of a chance to debug this without being able to exactly and quickly(!)+easily(!!) reproduce the error.