openmpi LAMMPS behaviour

Anna_Lappala · March 28, 2011, 5:56pm

Dear all,

I am running openmpi lammps 20-Feb2010 version on 32 processors, and my simulation, which runs perfectly on 8 processors with parallel ubuntu version (18-Dec2010) for at least 38 million timesteps, blows apart as soon as ~5/10 million steps on 32 processors due to lost atoms/angles. Could somebody perhaps help me to understand why this is happening (please let me know if more information is needed)?

thank you in advance.
with best wishes,
Anna

Anna_Lappala · March 28, 2011, 6:05pm

I apologise, but I forgot to add: if I run say 11 million timesteps and the simulation crashes, if I go back, write a restart from 10 millionth timestep, it goes on for next 20 million timesteps...

sjplimp · March 29, 2011, 2:21pm

It's unlikely that a million steps after the restart,
you are reproducing the original un-restarted calculation.
Even if you added a write to a restart file in the first run and let
it keep running, that would change the trajectory,
since the code forces a reneighboring on the steps a restart
file is written.

Undoubtably something is going badly wrong in your run
near step 11M. I suggest you write out lots of thermo
info and snapshots near that timestep and look at the output
to see what is bad.

Steve

Anna_Lappala · March 29, 2011, 2:31pm

Dear Dr Plimpton,

Thank you very much for your suggestion, I will do as you suggest! Another thing I noticed just now -- when I run my simulation on 64 processors, it terminates at timestep 3,5 million, however, exactly the same code run on 32 processors with exactly the same LAMMPS verison and other openmpi settings is now on step 4,3 million and happily running...

Thank you very much again,
Anna

sjplimp · March 29, 2011, 2:41pm

Again, there should be no expectation that you can run a simulation
for millions of steps on different numbers of processors and
get the identical trajectory. If you look at your thermo output, I assume
you will see this is not the case. So whatever is going wrong is
some rare event, that most likely is a flaw in your model (e.g. atoms
can get too close together, then fly apart). You need to figure
out what the issue is and fix it - not just hope that randomly running
again will make it go away.

Steve