LAMMPS issue with different MPI libraries!

Kasra · March 1, 2016, 12:17am

Hi everybody,

Following my inquiry back in Feb 14 2016 (), I have encountered more weired behavior from LAMMPS or better to say from LAMMPS+MPI library. After trying to resolve the non-reproducible as I described in the above thread, I concluded that with LAMMPS(gpu) + openmpi-1.10.2 + cuda-6.5, all compiled with intel-14.0.2, I don’t face the aforementioned issues on the cluster. However, as I started the production phase, I’m seeing following behaviors: – For this size of the problem which as I mentioned is a rather large size simulation (~300000+ atoms and a 10nm colloildal particle - an image of the filed is available in the above thread -) after a few thousands time steps the simulation is terminated with: ERROR on proc <proc #>: Out of range atoms - cannot compute PPPM (…/pppm.cpp:1918) The happening of “out of range atoms” is suspicious and shouldn’t be a numerical or physical behavior because I’m continuing the simulation from an already equilibrated system and also it’s a continuation from a system that was already 50ns of its life (I’m doing an ABF simulations on the nanoparticle) so it cannot be a bad dynamics and since I use “neigh_modify delay 0 every 1 check yes one 100000 page 1000000” for the re-neighboring criteria (one and page values are set to accommodate the large colloid neighbor list) the atom shouldn’t be missed in the neighbor list rebuilding. – The “Out of range atoms” doesn’t always happen at a specific time step and it varies depending on the number of processors that is chosen to run the simulation! (but due to the stochastic nature of the simulation I presume changing the number of processes can cause this to happen though?) – For example using 4 nodes on the cluster, each node equipped with 2 sockets of 10 cores xeon E5-2680 v2: (I have deactivated “fix colvars” in these runs to reduce the complexity of the problem) – with 80 mpi ranks and no openmp the error happens after 350 time steps (see file “no-omp.out” attached) – with 16 mpi ranks and 5 openmp threads/rank the error happens after 900 time steps (see file “ppr2-sockets-pe5_5omp.out” attached) I earnestly ask for any pointer and help that can shed light on resolving this issue. I can provide the input deck and the data file if you think it’s required for further investigation. The problem is that I see all these behavior with this large size simulation and with existence of the large colloidal particle (which would make neighbor lists much fuller than regular simulation (colloid diameter is 10nm)). Thank you, Kasra.

no-omp.out (10.2 KB)

ppr2-socket-pe5_5omp.out (6.75 KB)

sjplimp · March 1, 2016, 4:04pm

Out-of-range atoms occur in PPPM typically b/c reneighboring

is not done frequently enough and an atom has moved

too far outside a proc’s domain. If you are running a model

where this can happen, then when it happens can be random.

Meaning that if you run from a restart and trajectories

diverge (normal round-off), it could happen in one simulation

and not another. If you print out frequent thermo in

2 simulations, one where it happens, one where it doesn’t,

do the 2 simulations diverge before it happens?

Are you checking for reneighboing every timestep as

a conservative criterion?

Steve

Kasra · March 1, 2016, 5:40pm

Steve,

Yes, as I mentioned in my first email I'm using "neigh_modify delay 0 every 1 check yes one 100000 page 1000000" to make sure that I'm not missing anything in reneighboring phase. I examined the difference between the thermo output of the two simulations: 1. LAMMPS+mvapich2 with no PPPM error (attached file lammps-mvapich2.out) and 2. LAMMPS_openmpi.1.10.2 with PPPM error (attached file lammps-openmpi-1-10-2.out), where the columns are: step, water temp, wall temp, pressure of the system excluding the kinetic of the frozen wall atoms, press, epair, emol and etotal, respectively.
I have attached the output for both cases, I don't think there is any significant difference between the two output statistics except the usual random fluctuations! or should I look at something more stringent for comparison? All the simulations are continuation from the same LAMMPS datafile that I had already written out from previous simulations (I don't use restart files usually).

Best,
Kasra.

lammps-openmpi-1-10-2.out (42.5 KB)

lammps-mvapich2.out (103 KB)

Ray_Shan · March 1, 2016, 5:50pm

Your email is long and can be somewhat confusing… You mentioned GPU in the first thread. What happened to that? It seems you are not even using a GPU-enabled version? It appears you are only using the USER-OMP package, but compiled using different MPI compilers? If that is the case, then it is most likely a compiler issue, and yet we don’t know how exactly you compiled the two MPI/OMP variants…

R

Kasra · March 1, 2016, 8:20pm

Hi Ray,

Yes, it may get confusing because there are many combinations that I’ve tried and reported each with some sort of an issue. Here is a recap of my efforts that I’ve already reported (), hopefully it clarifies what I’ve done better: (Note: all the efforts have been made on QB2 clusters ) 1. LAMMPS + MVAPICH2-2.0 : I noticed the erratic behavior of lammps which for the exact same input file and settings it was either getting stuck, exiting with an error or running with no issues, these behaviors would also depend on the number of MPI ranks that is chosen (testing the simulation for a number of times, at some specific number of MPI ranks never happens and at some happens more regularly)! e.g. when giving error it was (Failed to reallocate 1179648 bytes for array <shake or bond or …>) 2. LAMMPS + MVAPICH2-2.0 + USER-OMP: per Axel suggestion of employing a hybrid mpi+openmp to speed things up in the large simulation that I’m running (), and I could see some performance enhancement. BUT the issues of getting stuck, exiting with an error or running with no issues still remains! 3. LAMMPS + MVAPICH2-2.0 + USER-OMP + GPU (using cuda 7.0 ): For testing the gpu effect on the performance I also used gpu library of LAMMPS but I couldn’t gain any better performance with gpu. Still above mentioned erratic behavior holds. 4. LAMMPS + OPENMPI.1.10.2 + USER-OMP : I switched to another MPI library as I was wondering this may be a combination of LAMMPS and MPI library that is causing the issue. Using openmpi.1.10.2, I was not getting the above mentioned erratic behavior but then the error “ERROR on proc <proc #>: Out of range atoms - cannot compute PPPM (…/pppm.cpp:1918)” is what terminates the simulation! (Note with MVAPICH2-2.0 if I was lucky the simulation would continue with no problem) 5. LAMMPS + OPENMPI.1.10.2 + USER-OMP + GPU (using cuda 7.0 ): The same as 4 but when I was also trying GPU runs it would exit with " " 6. LAMMPS + OPENMPI.1.10.2 + USER-OMP + GPU (using cuda 6.5 ): Using cuda-6.5 I was able to resolve the problem with gpu run in step 5. But since I didn’t notice any performance enhancement in my case using gpu I’m not using gpu in my runs but I use MPI+openmp for all my simulations. I hope it’s more clear now and please let me know if more explanation is required. All MPI libraries are compiled using INTEL-14.0.2 and mpic++ is using this compiler too. I have attached my makefile of LAMMPS, I was using Makefile.mvapich2 then I also tried LAMMPS package sample Makefile for openmpi, which I customized it for the cluster named it Makefile.openmpi. Best, Kasra.

Makefile.mvapich2 (3.14 KB)

Makefile.openmpi (3.58 KB)

sjplimp · March 2, 2016, 3:14pm

I have no idea. You now mention this error:

when giving error it was (Failed to reallocate 1179648 bytes for array <shake or bond or …>)

Obviously that’s a running-out-of-memory error. There is no solution other

than to run with more processors.

The only way anyone can help with any of this is for you to provide an input

script, data file for a small problem that runs quickly and consistently fails

on a small number of procs. I suggest it be for case (1) (best) or (2).

Steve

Kasra · March 2, 2016, 4:06pm

Hi Steve,

Yes, I had reported that error back in the thread (http://lammps.sandia.gov/threads/msg58843.html). This is an error that surfaces erratically when I was using MVAPICH2-2.0. But I never see this error when I use openmpi library except that I get the "ERROR on proc <proc #>: Out of range atoms - cannot compute PPPM (../pppm.cpp:1918)" in this case.
I have ran other simulation with smaller systems but I had never seen this peculiar behavior in smaller cases. I think the size of the problem is what is really triggering such behavior! I'd be more than happy to provide my input file and the datafile for this problem if someone can help me in tracking down the issue. The most complicated part of this problem is that it not 100% reproducible it's random and changes behavior with different combination of MPI libraries!

Best,
Kasra.

Ray_Shan · March 2, 2016, 4:13pm

All the following is making this a bad problem: not 100% reproducible, dependent upon MPI libraries, and having a rather large structure.

As Axel and I have been trying to point out, this is likely a problem with your MPI compiler and machine dependent. Unless you can generate a small input deck that fails 100% of the time without the known ‘out of range atoms’ error (which just indicates a bad dynamics) and be convincing that this is indeed a problem with a LAMMPS’ source code, it is highly unlikely anybody will look into this.

My suggestion is, try different compute clusters (machines) and stick with a MPI compiler that does not fail.

Ray

akohlmey · March 3, 2016, 12:48pm

All the following is making this a bad problem: not 100% reproducible,
dependent upon MPI libraries, and having a rather large structure.

As Axel and I have been trying to point out, this is likely a problem with
your MPI compiler and machine dependent. Unless you can generate a small
input deck that fails 100% of the time without the known 'out of range
atoms' error (which just indicates a bad dynamics) and be convincing that
this is indeed a problem with a LAMMPS' source code, it is highly unlikely
anybody will look into this.

My suggestion is, try different compute clusters (machines) and stick with a
MPI compiler that does not fail.

i would go one step farther and construct a small-ish test/debug input
that can be run with the MPI STUBS library and try to have as few
complications as possible, i.e. run with the colloid pair style only
first, without GPU, OpenMP and other "gimmicks" and then gradually try
them out one by one. checking with a valgrind's memcheck or a suitably
instrumented executable for memory and communication issues. with very
large colloid particles, i could imagine that several of the usual
heuristic consistency checks will not trigger as easily in case of
problems, as the difference between "normal-normal" and "big-big"
interactions is growing.

chasing down MPI library issues seems to me more like trying to fix
the symptom rather than the cause.

axel.

Kasra · March 3, 2016, 5:14pm

I appreciate the expert and valuable suggestions from y'all, I'll try what you have suggested and report back the outcomes.

Best,
Kasra.