changes in special.cpp

Dear all,

We recently changed from Lammps-16Mar18 to Lammps-7Aug19, but found a huge difference in Evdw and Ecoul between the versions.

16Mar18:

Time Volume PotEng Temp Cella Cellb Cellc CellAlpha CellBeta CellGamma E_bond E_angle E_dihed E_impro E_vdwl E_coul E_long
0 67398.851 4949.1186 100 40.1612 37.368 46.816677 90 106.4069 90 16417.349 323.13317 27.906241 0.001846363 -6236.5448 3306.2783 -8889.0052

7 Aug 19:

Time Volume PotEng Temp Cella Cellb Cellc CellAlpha CellBeta CellGamma E_bond E_angle E_dihed E_impro E_vdwl E_coul E_long
0 67398.851 38802.71 100 40.1612 37.368 46.816677 90 106.4069 90 16417.349 323.13317 27.906241 0.001846363 23298.102 7625.2225 -8889.0052

Another difference in the output was in the number of special bonds;

16Mar18:

18 = max # of special neighbors

7Aug19:

17 = max # of special neighbors

I noticed that "special.cpp" has substantially changed between the two versions. Recompiling Lammps-7Aug19 with special.cpp and special.h from Lammps-16Mar18 results in an PotEng of 4949.1186 again.

What is the cause of this difference? For this case, coronene crystals are no longer stable with the new version.

The input file is:

units real

atom_style full

bond_style harmonic
angle_style harmonic
kspace_style ewald 1e-6
#dihedral_style harmonic
dihedral_style harmonic
improper_style cvff
pair_style buck/coul/long 10.0
pair_modify shift yes
special_bonds lj 0.0 0.0 0.0 coul 0.0 0.0 0.0

box tilt large

read_data supercell.data

replicate 4 8 3
velocity all create 100 12

thermo_style custom time vol pe temp cella cellb cellc cellalpha cellbeta cellgamma ebond eangle edihed eimp evdwl ecoul elong
thermo 1000

dump 1 all custom 1000 coronenegamma.lammpstrj element xu yu zu q
dump_modify 1 element C H sort id

timestep 0.5

fix 1 all npt temp 100 300 40 tri 1 1 400
run 1

With kind regards,

Herma Cuppen

Herma,

information about the changes in special.cpp can be gathered from looking through the commit comments in the commit history: https://github.com/lammps/lammps/commits/master/src/special.cpp

the major difference is the switch from a ring communication to a rendezvous algorithm to improve parallel performance when run on many MPI ranks. the code has passed many tests, so seeing a difference like you are noting is a bit of a surprise. the only unusual command in your input is the “box tilt large” is this absolutely required? could you try a system, that doesn’t require this command?
does the failure happen consistently regardless of the number of MPI ranks used?

in general, the best way to make certain that some issue like this one is being looked into properly is to file a report with suitable files to (quickly!) reproduce the issue at the LAMMPS issue tracker on github. https://github.com/lammps/lammps/issues
mails to the mailing list have a tendency to get overlooked, if nobody has the time to react on them immediately.

thanks,
axel.

I’ll second Axel in that if you post (mail list or GHub) a small-as-possible
example that quickly gives a different answer between the 2 versions,
we will look at it quickly. If this is in fact a bug with the new version
of special.cpp, we would want to know.

Steve

Dear Steve and Alex,

I reduced the input file a bit.
The result is independent of the number of cores. I typically use intel compilers in combination with openmpi, but also if I just use gnu compilers I get the same results. Ubuntu version also does not make a difference.

I tried different “special_bonds” settings. This off course changes the outcomes, but the two versions keep giving different results.

The data file can be found here:

Herma ___________ units real atom_style full bond_style harmonic angle_style harmonic kspace_style ewald 1e-6 dihedral_style harmonic improper_style cvff pair_style buck/coul/long 10.0 pair_modify shift yes special_bonds lj 0.0 0.0 0.0 coul 0.0 0.0 0.0 read_data supercell.data replicate 2 4 1 velocity all create 100 12 thermo_style custom time vol pe evdwl ecoul elong thermo 1000 timestep 0.5 fix 1 all nvt temp 300 300 40 run 0

So I tried 3 versions on these new inputs you sent.
16Mar18, 7Aug19, current GHub master.

All give identical answers for Evdwl and Ecoul.
And same # of 18 max # of special neighs.

This is on 1 core of my Broadwell desktop with GNU compiler
and OpenMPI.

What am I supposed to be seeing?

Steve

I made a mistake in my previous email. Using a GNU compiler does make a difference. This morning made several new builds using different compilers. It appears an intel problem. I use icc in combination with openmpi, since the intel mpi (Version 19.0.1.144) gives a memory leak. I now compiled also using a newer version of intel compiler (make intel_cpu_intelmpi), but I did not test for the memory leak yet, since these test runs take an hour or so. This build still gives wrong results. If I make the simulation cell a little larger, "max # of special neighbors" starts to deviate as well.

Herma

16Mar18 - intel openmpi (Version 19.0.1.144)
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
18 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 5616.5709 412.43142 100 -519.71207 275.52319 -740.74556

7Aug19 - mpi
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
18 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 5616.5709 412.43142 100 -519.71207 275.52319 -740.74556

7Aug19 - intel openmpi (Version 19.0.1.144)
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
18 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 5616.5709 2368.5426 100 1202.6081 509.31419 -740.74556

7Aug19 - intel_cpu_intelmpi (Version 19.0.5.281)
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
18 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 5616.5709 2368.5426 100 1202.6081 509.31419 -740.74556

replicate 4 8 3 instead of replicate 2 4 1 (7Aug19 - intel_cpu_intelmpi)
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
17 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 67398.851 38802.71 100 23298.102 7625.2225 -8889.0052

replicate 4 8 3 instead of replicate 2 4 1 (7Aug19 - mpi)
3 = max # of 1-2 neighbors
6 = max # of 1-3 neighbors
18 = max # of 1-4 neighbors
18 = max # of special neighbors
Time Volume PotEng Temp E_vdwl E_coul E_long
0 67398.851 4949.1186 100 -6236.5448 3306.2783 -8889.0052

This is puzzling b/c special.cpp does not do anything numerical.
It is simply figuring out the topology of molecules. So I wouldn’t
think the compiler matters. Also if you “make the simulation cell
larger” by simply replicating more then I don’t think replicating makes
individual molecular any larger. Just more of them. So not clear
how replication can change the max special neigh count.

Axel, any ideas?

Steve

Some versions of the the Intel compilers have known issues where they miscompile correct code, especially with high optimization settings. Outside of the USER-INTEL package, which benefits from the superior vectorization capabilities of the Intel compilers, there is little reason to use them over GCC or CLang. The performance difference compared to LAMMPS executables compiled with suitably recent GCC or CLang compilers is often negligible and dwarfed by the performance gains from properly tuning runtime settings in the LAMMPS input file (neighbor list, load balancing, MPI vs. multi-threading, coulomb cutoff vs. kspace, processor mapping etc.).

Mind you, this is not limited to Intel compilers, but in my personal experience, they have been more prone to this in recent years. My guess is that this is due to more aggressive attempts at vectorization, which seems to collide with some of the programming style employed in LAMMPS.

This is puzzling b/c special.cpp does not do anything numerical.
It is simply figuring out the topology of molecules. So I wouldn’t

but if the compiler misses data dependencies when trying to vectorize loops, it can produce incorrect neighbor lists, which in turn will result in the incorrect energies/forces.

think the compiler matters. Also if you “make the simulation cell
larger” by simply replicating more then I don’t think replicating makes
individual molecular any larger. Just more of them. So not clear
how replication can change the max special neigh count.

again, this would be due to miscompiled code. so with more atoms/molecules, the neighbor lists would be different and then loops would process atoms in a different order and in different subdomains and thus it is quite possible to see different results and exclusion counts etc.

axel.

Whilst its not necessarily the compiler since undefined behavior takes many sneaky forms that manifest between compilers differently, we can probably shed light on how aggressive the intel compiler is being by seeing the make file being used. Normally auto-vectorization of the intel compiler is very conservative for the default -O2 and -O3 settings as far as I’m aware (the autovectorization report for lammps in my experience fails for most primary loops such as the force computation due to pointer access and function calls) so it’d probably have to be a more aggressive setting.

Adrian Diaz

I just use the standard makefile (Makefile.intel_cpu_openmpi) which uses -O2 with some additional flags.
I did some timings, I noticed some time benefit when using in combination with USER-INTEL. That was initially the reason for me to use intel compilers. We also use quantum chemical codes in our group where the difference is more beneficial.
Since we have the new 2019 compilers, I had many problems. I will stick to GNU compilers in the future.

Perhaps the intel makefiles could come with some warning?

Herma

Changes in special.cpp for Aug 2019 and also changes in the 2019 Intel compilers resulted in an issue for code segments like:

onethree[i][nspecial[i][1]++] = outbuf[m].partnerID;

where temporary objects from dereferencing with the inline ++ operation result in hidden aliases that break code compiled with the “-fno-alias” option.

Options are:

  • Use Intel 2018 compilers
  • Use Intel 2019 compilers and remove the “-fno-alias” from the Makefile flags with some performance impact to the USER-INTEL package.
  • Use 2019 compilers with a more recent version of LAMMPS (since Sep 19) that include this fix: https://github.com/lammps/lammps/pull/1683

As Axel noted, if you are using the USER-INTEL package, the Intel compilers really are a benefit due to a lot of compiler generated vector code. This should perform significantly better on supported simulations.

Sorry that you (also) had to fight with this issue.

Best, - Mike

Thanks Mike. Should we just convert these single lines in special.cpp

onethree[i][nspecial[i][1]++] = outbuf[m].partnerID;

to 2 lines each? There are only a handful of them.
That way users don’t have to worry about compiler version
or Makefile settings?

Steve

Thank you, Mike.

With option 2 I now get indeed the correct result.

I guess option 3 also requires some additional changes to special.cpp as Steve suggests?

Herma

Thanks Herma and Steve.

I guess option 3 also requires some additional changes to special.cpp as Steve suggests?

LAMMPS versions later than Sep19 2019 shouldn’t have any issues. Code changes to allow the same performance without the “-fno-alias” flag for USER-INTEL are included in these versions and the Makefiles have this flag removed.

Should we just convert these single lines in special.cpp

I’m not really an expert in C++ legalese, but this does seem like more of a compiler issue to me. Since there are current workarounds, we probably don’t need to change the current version of LAMMPS. I’ll dig a little more to see whether it makes since to change the code in LAMMPS. Our compilers are in a transition now that makes things a little tricky. Thanks for offering to help here.

Best, - Mike