restarted run diverges, but no lammps warning

_Chetan_Mahajan · April 19, 2012, 2:46am

Hi All

I am simulating solvated polymer system and measuring the diffusivity of solvent through polymer. I have original runs for two systems and also runs restarted from a midpoint of the trajectory of original runs.

Lammps manual says that restarted run is made to match with original run but output of the two may diverge in certain cases. I have diffusivity changing by 23 or 45 in in restarted runs, compared to original runs. Surprisingly, I do not have any lammps warnings. I do not have any differences between original and a restarted run as mentioned on page:http://lammps.sandia.gov/doc/read_restart.html

I just have a supercomputer warning (for the restart runs, not the original) that compiler environment is pgi, whereas lammps executable was made with intel.

Interesting, I had made one more run of one of these systems from scratch (not using any restart file) and it’s diffusivity changes by just 6 % compared to original runs.

I wonder what could be the issue. Any help greatly appreciated.

Thanks

Chetan

akohlmey · April 19, 2012, 2:54am

Hi All

I am simulating solvated polymer system and measuring the diffusivity of
solvent through polymer. I have original runs for two systems and also runs
restarted from a midpoint of the trajectory of original runs.

Lammps manual says that restarted run is made to match with original run but
output of the two may diverge in certain cases. I have diffusivity changing
by 23 or 45 in in restarted runs, compared to original runs.
Surprisingly, I do not have any lammps warnings. I do not have any
differences between original and a restarted run as mentioned on
page:http://lammps.sandia.gov/doc/read_restart.html

there are many possible reasons, one of which that
your measuring of (self-)diffusion is not properly converged.
for the system you quote, that is quite possible.

I just have a supercomputer warning (for the restart runs, not the
original) that compiler environment is pgi, whereas lammps executable was
made with intel.

???

Interesting, I had made one more run of one of these systems from scratch
(not using any restart file) and it's diffusivity changes by just 6 %
compared to original runs.

that just confirms the suspicion about lack of convergence.
have you made any tests on that? i.e. start from multiple,
sufficiently different but equivalent initial configurations?

axel.

_Chetan_Mahajan · April 19, 2012, 3:53am

Axel

Yes, I am ware in such systems, convergence is always an issue. This becomes more appreciable for all-atom simulations, which can be executed only upto few ns, especially when many multiple runs are required. Literature tries to understand the system within these limitations. Now, the correlation coefficient for my linear diffusion fit is 0.9976, 0.973, 0.9934 etc. So they are reliable for engineering purposes where only relative diffusion behavior matters. We are already in the process of making multiple runs, and my rerunning orignal without restart mentioned in the end of earlier email is one of that.

Warning that I am getting from a supercomputer is:

TACC: Done.
TACC: Starting up job 2498456
TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
TACC: Setup complete. Running job script.

akohlmey · April 19, 2012, 6:00am

Axel

Yes, I am ware in such systems, convergence is always an issue. This becomes
more appreciable for all-atom simulations, which can be executed only upto
few ns, especially when many multiple runs are required. Literature tries to
understand the system within these limitations. Now, the correlation
coefficient for my linear diffusion fit is 0.9976, 0.973, 0.9934 etc. So
they are reliable for engineering purposes where only relative diffusion

check out slide 11 on this presentation for a contradicting example:
http://klein-group.icms.temple.edu/akohlmey/files/talk-trieste2004-water.pdf

behavior matters. We are already in the process of making multiple runs, and
my rerunning orignal without restart mentioned in the end of earlier email
is one of that.

Warning that I am getting from a supercomputer is:

TACC: Done.
TACC: Starting up job 2498456
TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
TACC: Setup complete. Running job script.
******************************************************
WARNING: Your Compiler Environment is : pgi-7.2
Your executable was built with: intel-10.1
******************************************************

I will submit a job with change in compiler environment if possible.

that would be pointless.
how should that be of any
impact on the simulation?

axel.

sjplimp · April 19, 2012, 1:20pm

If the restarted vs continued run agree in their
thermo output at the beginning of the restart,
and slowly diverge, then I don't think this is a
LAMMPS issue. After a few 1000 steps you
typically can't expect any match, due to
the nature of MD and round-off issues.

Then, as Axel said, the statistical variation
can be great for something like a diffusion
coeff.

Steve

_Chetan_Mahajan · April 19, 2012, 4:40pm

Hi Steve and Axel

@Steve:

This is what my initial (for the restart) thermodynamic output shows:

First for original run, at timestep 5000000

Step CPU Press Temp PotEng KinEng TotEng Enthalpy E_vdwl E_hbond n_hbond E_coul E_pair E_bond E_angle E_dihed E_impro E_mol E_long Volume
5000000 0 530.77605 338.66649 -21386.939 13285.026 -8101.9122 -6970.1244 7432.3133 -28.546358 75
25905.026 -37130.17 6608.0543 4723.902 3575.6798 835.59555 15743.232 -70467.51 146210.24

Now, following for For run restarted at timestep 5000000:

Step CPU Press Temp PotEng KinEng TotEng Enthalpy E_vdwl E_hbond n_hbond E_coul E_pair E_bond E_angle E_dihed E_impro E_mol E_long Volume
5000000 0 522.60194 338 -21386.939 13258.882 -8128.057 -7013.6991 7432.3133 -28.546358 75
25905.026 -37130.17 6608.0543 4723.902 3575.6798 835.59555 15743.232 -70467.51 146210.24

See, some of the values such as pressure (column 3), Kinetic energy (column 6), Total Energy (column 7), Enthalpy (column 8) are different, but I am not sure if they are significantly different.

Also, could you explain more what do you mean by nature of MD causing original and restarted runs to diverge? other than roundoff effects, I suppose it includes convergence issues, but what else it may include?

@ Axel: Since that compiler warning came in ONLY in restarted runs (not in the original), I wondered if that is causing a trouble. Thanks for the slide. It is interesting to know.

Thank you all,
regards
Chetan

akohlmey · April 19, 2012, 5:51pm

Hi Steve and Axel

@Steve:

This is what my initial (for the restart) thermodynamic output shows:

First for original run, at timestep 5000000

Step CPU Press Temp PotEng KinEng TotEng Enthalpy E_vdwl E_hbond n_hbond
E_coul E_pair E_bond E_angle E_dihed E_impro E_mol E_long Volume
5000000 0 530.77605 338.66649 -21386.939 13285.026
-8101.9122 -6970.1244 7432.3133 -28.546358 75
25905.026 -37130.17 6608.0543 4723.902 3575.6798 835.59555
15743.232 -70467.51 146210.24

Now, following for For run restarted at timestep 5000000:

Step CPU Press Temp PotEng KinEng TotEng Enthalpy E_vdwl E_hbond n_hbond
E_coul E_pair E_bond E_angle E_dihed E_impro E_mol E_long Volume
5000000 0 522.60194 338 -21386.939 13258.882
-8128.057 -7013.6991 7432.3133 -28.546358 75
25905.026 -37130.17 6608.0543 4723.902 3575.6798 835.59555
15743.232 -70467.51 146210.24

See, some of the values such as pressure (column 3), Kinetic energy (column
6), Total Energy (column 7), Enthalpy (column 8) are different, but I am not
sure if they are significantly different.

yes. they are. you must be resetting your temperature in some way,
which is the same as giving your system a kick. please check your
input. these data should be identical.

Also, could you explain more what do you mean by nature of MD causing
original and restarted runs to diverge? other than roundoff effects, I
suppose it includes convergence issues, but what else it may include?

it is not really roundoff, it is a truncation that happens when
you, e.g., add or subtract two numbers of different magnitude.

floating point math doesn't commute, so the result of sums
will depend on the order. a restart causes a neighbor list rebuild
and that will change the order in which properties like forces are
summed up which will make the trajectories diverge exponentially.
this is fundamental MD knowledge, check out in, e.g., the allen
and tildesley book.

unless you do fixed point math this will always happen. but this
is a small change and will manifest over time.

the convergence issue is independent. as the slide i was pointing
you to, you may think you have a converged result (the slide is
showing the first derivative(!) not the MSD, those look straight),
but you don't cover all modes in your system. a lot of mistakes
in MD analysis and thus differences in the published literature
happen because people don't know their time scales.
...and it is very difficult (i.e. would take a huge effort) to prove otherwise.

axel.

_Chetan_Mahajan · April 19, 2012, 7:15pm

Thanks, Axel. Yes, as my labmate and you pointed out, I unfortunately had a a velocity set (but set to the same temperature of the original run) command after restart. After removing it, everything seems fine.

regards
Chetan