making lammps (in parallel) on ranger

_Chetan_Mahajan · October 1, 2011, 3:35am

Hi

We are trying to compile lammps (in parallel) on Ranger supercomputing system again, after Ranger underwent some changes. Attached is a Makefile that TACC consulting (mainly Yaakoub) have generated for us. It’s compiling fine and i even used the executable generated lmp_tacc for my simulations. For system 1, the water diffusivity obtained with this new lmp_tacc is exactly same as that calculated by using lmp_ranger (old executable on ranger before ranger underwent any recent changes), however, for system 2, water diffusivity is 23 % lower than earlier value obtained from using lmp_ranger. When this was conveyed to TACC, i was told that he is using some vanilla flags for compilation, but it’s not clear if that is what is causing the trouble. I was advised to post this on lammps mailing list and so please let me know your opinion.

Thanks
Chetan

Makefile.tacc (2.8 KB)

akohlmey · October 1, 2011, 3:46am

chetan,

you are missing a number of essential pieces of information:
- what version of LAMMPS exactly has been compiled
- what compiler/MPI is being used
- what are the test cases that you are using?

axel.

_Chetan_Mahajan · October 1, 2011, 4:05am

Hi Axel

Thanks. The new lmp_tacc has been generated using latest lammps through svn. The old lmp_ranger was generated for lammps until mid-May 2011.

We need to load following modules in order to make lammps on new Ranger, using Makefile.tacc that i attached earlier:

intel,mvapich,mkl,fftw2

System 1 for which old and new lammps executable are producing exactly same results is Nafion-solvated by water. System 2 for which the results differ by 23 % is SPEEK (Sulfonated poly-ether-ether-ketone) solvated by water.

Thanks
Chetan

akohlmey · October 2, 2011, 1:39am

Hi Axel

Chetan,

Thanks. The new lmp_tacc has been generated using latest lammps

That is a very imprecise description. As researcher you should get used to provide accurate descriptions. If you run a lammps executable it prints out the exact version. Sometimes the exact day or rather patch date can make all the difference.

through svn. The old lmp_ranger was generated for lammps until mid-May 2011.

We need to load following modules in order to make lammps on new Ranger, using Makefile.tacc that i attached earlier:

intel,mvapich,mkl,fftw2

This should make no difference at all. Either the compiler and MPI work or they don’t.

System 1 for which old and new lammps executable are producing exactly same results is Nafion-solvated by water. System 2 for which the results differ by 23 % is SPEEK (Sulfonated poly-ether-ether-ketone) solvated by water.

Again, this is a pretty useless description. Without seeing the exact inputs and particularly the potentials and settings they use and knowing which measure you use to determine the deviation (i.e. which property is 23% off) it is impossible to gauge whether the deviation is acceptable or due to a fixed bug or die to a yet unknown bug or some other mistake.

Axel

_Chetan_Mahajan · October 4, 2011, 2:37am

Ok! Axel, i just tried to answer the information you wanted as it seemed from your earlier email. Deviation is just calculated between two water diffusivity values from old and new executable (% calculated with respect to maximum value). I understand it may be ambiguous and so I have simplified things with some new data as attached in excel file herewith.

Let me rephrase the question again: Question is if lammps is compiling perfect on new Ranger with New Makefile to give executable lmp_tacc, should we assume that it is as accurate as the lammps executable (lmp_ranger) generated (WITH SAME LAMMPS VERSION) on old Ranger with old Makefile. TACC consultants told me that i should check the accuracy of lmp_tacc.

Coming back to excel file: I have considered both systems 1) SPEEK-water 2) Nafion-water. FOr each system, i have tabulated diffusivity of water (diffw) and diffusivity of hydronium (diffh) for 3 types of runs 1) with lmp_tacc 2) lmp_ranger 3)lmp_ranger-RERUN.
3) was done to see how much diffusivity varies with the same run carried again.

Below these diffusivity values, i have calculated Deltalmp (Difference in diffusivities when lmp_tacc is used instead of lmp_ranger and Deltanormal (difference in difffusivities when same run carried out again). Aim is to compare Deltalmp with Deltanormal and determine if lammps is giving more than normal deviations in output due to using lmp_tacc instead of lmp_ranger and therefore respond to TACC consultants on whether lmp_tacc is accurate enough.

The ratios listed in blue color do say that lmp_tacc should be accurate enough. Two of the cases they are less than 1. One case it is 1.65, slightly more than 1, whereas in one case it is 5, but as i have commented there, the R2-values of linear fit to MSD Vs time to determine diffusivity are terribly low.

SO i guess all is fine! Comments, if any, are welcome!
cheers
Chetan

lmperror.xls (15 KB)

akohlmey · October 4, 2011, 3:14am

chetan,

you are still not getting it. most of what you did is a waste of time
and just shipping an excel file with some numbers is no proof at all.
for as long as nobody else can verify what you did _independently_,
you simply don't know and i cannot tell you as well.

fist of all, you didn't answer my question about the _exact_
LAMMPS version. you just wrote "the latest", but that is a
very inaccurate description since "the latest" can change
daily. instead you should give the date string that lammps
prints. i am asking this for a specific reason, since there
recently was a bug present for a few days that could have
affected calculations like yours.

second, you should start by testing for a property that is less
noisy than diffusivity. rather than using your own inputs, you
should try to run some of the inputs that ship with lammps.
and you should get (mostly) identical values for the entire
output, except for timing. the examples are run with one
and 4 processors, so you can compare and see how much
deviation between runs is possible. different inputs can
deviate differently.

there are basically two issue that can happen:

- your compiler miscompiles some part of the code,
  which can lead to differences in the "thermo" properties.
  those usually happen with high compiler optimization,
  so in case of differences you should compile with optimization
  turned off and/or using a different compiler (e.g. gcc).
  if any or the combination of the two steps yields better
  agreement with the reference outputs, then you may
  have found a reason. whether these differences are
  significant, is a different story. sometimes they are
  sometimes not. this is where you should post the
  results to the list for people look at.

- if you still have differences than it is likely that there
  is a bug in the specific version of lammps that
  you have compiled. this can happen. sometimes little
  change can have unexpected side effects. in this case
  you should first update to the very latest version and
  recompile. then repeat the check from above and if
  it persists contact the mailing list with the specific
  input and output and deviations that you see.

Ok! Axel, i just tried to answer the information you wanted as it seemed
from your earlier email. Deviation is just calculated between two water
diffusivity values from old and new executable (% calculated with respect to
maximum value). I understand it may be ambiguous and so I have simplified
things with some new data as attached in excel file herewith.

Let me rephrase the question again: Question is if lammps is compiling
perfect on new Ranger with New Makefile to give executable lmp_tacc, should
we assume that it is as accurate as the lammps executable (lmp_ranger)
generated (WITH SAME LAMMPS VERSION) on old Ranger with old Makefile. TACC
consultants told me that i should check the accuracy of lmp_tacc.

lets forget about the TACC guys. they are nice folks but most
likely have no practical experience in MD.

Coming back to excel file: I have considered both systems 1) SPEEK-water 2)
Nafion-water. FOr each system, i have tabulated diffusivity of water (diffw)
and diffusivity of hydronium (diffh) for 3 types of runs 1) with lmp_tacc 2)
lmp_ranger 3)lmp_ranger-RERUN.
3) was done to see how much diffusivity varies with the same run carried
again.
Below these diffusivity values, i have calculated Deltalmp (Difference in
diffusivities when lmp_tacc is used instead of lmp_ranger and Deltanormal
(difference in difffusivities when same run carried out again). Aim is to
compare Deltalmp with Deltanormal and determine if lammps is giving more
than normal deviations in output due to using lmp_tacc instead of lmp_ranger
and therefore respond to TACC consultants on whether lmp_tacc is accurate
enough.
The ratios listed in blue color do say that lmp_tacc should be accurate
enough. Two of the cases they are less than 1. One case it is 1.65, slightly
more than 1, whereas in one case it is 5, but as i have commented there, the
R2-values of linear fit to MSD Vs time to determine diffusivity are terribly
low.

this is all very confusing and i fail to see how you deduce that
your results are "good". all you have is an incomplete "internal"
standard that is not accounting for a number of potential issues
and a "quality parameter" that is very noisy. if you don't believe me
on the latter, have a look at this:
http://klein-group.icms.temple.edu/akohlmey/files/talk-trieste2004-water.pdf

SO i guess all is fine! Comments, if any, are welcome!

you may have convinced yourself.
but to put it differently: if this was part
of a paper, i would have to reject it.

i am not saying that your executable
is broken, but there is nothing that you
have produced that is convincing.

cheers,
axel.

_Chetan_Mahajan · October 5, 2011, 6:22pm

Axel,

I did not answer the version question specifically since as i said, for the runs i was doing, lammps was exactly similar for both the executables. Anyways, the version for lmp_tacc is Lammps 16 Sept 2011.

Testing with lammps sample examples was a good suggestion. Thanks. I did it with peptide example and lmp_tacc (new lammps executable on new ranger) yields almost similar answers as that provided in log.peptide.28Mar11.linux.4. I have attached both the log files herewith (log.lammps is from lmp_tacc).

I have communicated to TACC consultants about your comments about optimization. I still feel my data on diffusivities attached earlier does convery come meaningful information. However, your point of diffusivity being a noisy parameter is noted.

I will get back to you later in detail if required.

Thanks
Chetan

log.lammps (5.65 KB)

log.peptide.28Mar11.linux.4 (5.63 KB)

sjplimp · October 6, 2011, 1:39pm

You never said whether the difference between the 2 runs (same
version of code, running on different Ranger versions) was radically
different, or just epsilon different, and slowly diverged over time.
If the latter, then that could be normal round-off behavior.

Steve

_Chetan_Mahajan · October 6, 2011, 2:58pm

Hi Steve

That’s precisely what i tried to do in that excel file sent earlier. I compared delta diffusivity difference between two normal runs (due to numerical round-offs etc) with the delta diffusivity difference between two runs with two different ranger executables. If you see the ratio, it’s less than or comparable to 1 in 3 of the 4 cases which means, everything is fine. one case where it is significantly higher than 1, does not have good R2-values for linear fit to msd Vs time, so it is not reliable in first place. However, i do note Axel’s point that diffusivity could be noisy and thus elude us from correct conclusion. So i did lammps test run (peptide) with both the executables lmp_ranger and lmp_tacc. THey both yielded the almost the same answers as lammps benchmark output.

Also, one thing i should have been more specific about is that lmp_ranger was indeed with older version of lammps (i think Nov 2010), However, and this is important, for the runs i was doing, there were NO patches relevant to my runs, between these two versions. So practically, both versions were exactly same. It has indeed been validated by above benchmark run. I did not mention this in earlier email, since 1) I typed it in a real hurry 2) i did not want to add to confusion since my main goal was to gather opinion about differences that may arise due to solely running it with executables compiled with different Makefiles on slightly different supercomputing systems.

I am attaching all the files including the excel file herewith again.

Thanks a lot!
Chetan

lmperror.xls (14.5 KB)

log.lammps (5.65 KB)

log.lammps-lmp_ranger (5.62 KB)

log.peptide.28Mar11.linux.4 (5.63 KB)

sjplimp · October 7, 2011, 1:10pm

I think I'm asking a slightly different Q. When the
2 runs are different, I want to know if they started
out identical for the first few 100 or 1000 timesteps
and the thermo output slowly diverged. If so, then
that is typically of running the same version of LAMMPS
on different hardware (or even compiled for the same
machine but with different system software). In that
case a different final quantity (e.g. averaged P or Diff coeff)
us likely just different statistical averages for 2 runs, as if
you started them with different random velocities.

If the 2 runs are very different thermo output even on the
1st step, then that would be something to look into.

Steve