Lammps result validation

sharad_saurabh · November 15, 2023, 5:31am

Hi Team

I am trying to validate the floating point difference for comparing the benchmarks with various compiler.

Currently I have two question.

For intel package ,GCC is not working properly. What is the accepted choice of compiler for benchmarking in.rhodo.scaled and in.lj dataset for intel Package?
For in.rhodo.scaled and in.lj I am running with var x,y,z as 8,8,8 value. We are measuring total Eng value. in this case
a. what is the ideal value of total Energy ?
b. What is the tolerance in floating point difference if the benchmarks are performed with other compilers

Thanks
Sharad

akohlmey · November 15, 2023, 5:57am

What version of LAMMPS are you using for this?

What do you mean by “not working properly”? And what version of GCC? “Not working” is a very unscientific description. Please provide details and explanations how you arrive at your conclusion instead of making only a dismissive statement. Otherwise it is impossible to properly follow up on your findings and make corrections, if needed.

The INTEL package is intended to be compilable and thus capable of producing an executable that can run the inputs and produce repeatable results with any compiler (e.g. GCC, Clang, Intel, Nvidia SDK, etc.). This is regularly tested through the unit test library of tests. However, currently only the Intel compiler fully support the various vectorization directives included in the Intel package sources and thus will produce faster running executables. Other compilers will ignore some or most of those.

Since floating point match is not associative (i.e. the exact result of a sum depends on the order in which those numbers are added), there cannot be an absolute “correct” energy. That value will depend crucially on multiple factors: number of parallel processors, parallelization division between threads and MPI ranks, choice of compiler, choice of compiler optimization settings.

MD simulations are chaotic systems, i.e. they are subject to Lyapunov instabilities, also known as the “butterfly effect”. That means that even the tiniest difference in the force computation will lead to eventually diverging trajectories. How quickly this divergence happens crucially depends on the choices of features inside the simulation. E.g. pair styles that use spline tables internally are more sensitive since even the tiniest difference for a point that is very close to split tabulation point clan lead to using one or the other between two neighboring spline parameters and thus increase the divergence. Similarly, using fix npt accelerates the divergence because of the rescaling of all positions in every step when the volume is adjusted since this adjustment is based on the pressure, which is quite sensitive to floating-point math issues because it is a sum of values with opposite sign and rather large differences in magnitude and thus the impact of the non-associative behavior of floating point match is enhanced. However, despite these divergences, the different trajectories are usually equally valid since they are still sampling the same phase space. In fact, there are simulation techniques like PRD that are based on this fact and deliberately introduce differences in the initial settings (like particle velocities) to “randomize” the trajectories and thus do improved statistical sampling of rare events by running multiple decorrelated trajectories concurrently instead of a single long trajectory until the rare event happens.

sharad_saurabh · November 15, 2023, 7:55am

Hi Akohlmey
I am using 08Feb2023 version of lammps

sharad_saurabh · November 15, 2023, 8:00am

What do you mean by “not working properly”? And what version of GCC? “Not working” is a very unscientific description. Please provide details and explanations how you arrive at your conclusion instead of making only a dismissive statement. Otherwise it is impossible to properly follow up on your findings and make corrections, if needed.

I am trying GCC 13.1.0 version for building lammps 08 Feb 2023. Below is my runs parameter

Run command :"mpirun -mca btl vader,self -np ${ncpu} --map-by core --bind-to core $LMP -var x 8 -var y 8 -var z 8 -in in.rhodo.scaled -sf intel -pk intel 0 "
Issue faced : For each process facing segmentation fault:
=================================
==== backtrace (tid:2622070) ====
0 0x0000000000012ce0 __funlockfile() :0
1 0x00000000010e7c45 LAMMPS_NS::FixIntel::FixIntel() ???:0
2 0x000000000067f781 style_creator<LAMMPS_NS::Fix, LAMMPS_NS::FixIntel>() modify.cpp:0
3 0x000000000068ab2b LAMMPS_NS::Modify::add_fix() ???:0
4 0x000000000068d771 LAMMPS_NS::Modify::add_fix() ???:0
5 0x0000000000464ba5 LAMMPS_NS::Input::package() ???:0
6 0x000000000046ab56 LAMMPS_NS::Input::execute_command() ???:0
7 0x000000000046b792 LAMMPS_NS::Input::one() ???:0
8 0x00000000006de714 LAMMPS_NS::LAMMPS::post_create() ???:0
9 0x000000000070e554 LAMMPS_NS::LAMMPS::LAMMPS() ???:0
10 0x000000000045c10f main() ???:0
11 0x000000000003aca3 __libc_start_main() ???:0
12 0x000000000045da9e _start() ???:0
=================================

sharad_saurabh · November 15, 2023, 8:04am

Clearly there is some issue with GCC, and we have added patch for AOCC compilers to work on INTEL package. So it is not well tested package for all available main stream compilers. My intention of asking the default testing compilers for INTEL PACKAGE which had full test coverage. I got the point that we should mainly work with intel compiler to leverage the full vectorization capability and intrinsic provided by the package

sharad_saurabh · November 15, 2023, 8:09am

Let me give some more details on the same
Given the processors ,variables of x,y,z thread and MPI rank are same across compilers with same MPI pinning. What would be ideal way of selecting the key performance indicator. Suppose I am selecting the Total Eng value. What would be the accepted variation in floating point calculation given all the parameter except Compilers are same.

akohlmey · November 15, 2023, 9:08am

I have already explained that exact reproduction of the numbers is not possible and that they diverge exponentially.

So what you say you want to do and want to know makes no sense to me.

akohlmey · November 15, 2023, 9:22am

When you get a crash, always check with the latest release. There may be a bug in your version that has already been fixed.

srtee · November 15, 2023, 9:34am

On a very practical note – did you compile with debug info? (-g)

akohlmey · November 15, 2023, 10:09am

Depending on your machine configuration, you may just be running out of RAM. Have you tried with smaller replication. This will happen specifically with the INTEL package, because it will (more than) duplicate the storage requirements to have the data available in specially aligned and ordered ways, so it can better vectorize. Also it will maintain caches of data in single and double precision, since you can switch between full double precision, full single precision and mixed precision at run time (default is mixed precision).

On my laptop with 16GB RAM I can run with 4 MPI processes a 4x4x4 system, for example at 6x6x6 I run out of memory.

akohlmey · November 15, 2023, 1:26pm

FYI, I just ran with LAMMPS version 2 August 2023 Update 1 on our big memory server with:
mpirun -np 12 ../build/lmp -in in.rhodo.scaled -v x 8 -v y 8 -v z 8 -sf intel
and there was no crash. This is the configuration shown by lmp -h:

OS: Linux "CentOS Linux 7 (Core)" 3.10.0-1160.99.1.el7.x86_64 x86_64

Compiler: GNU C++ 11.2.0 with OpenMP 4.5
C++ standard: C++11
MPI v3.1: Open MPI v4.0.4, package: Open MPI install@compute Distribution, ident: 4.0.4, repo rev: v4.0.4, Jun 10, 2020

Accelerator configuration:

OPENMP package API: OpenMP
OPENMP package precision: double
INTEL package API: OpenMP
INTEL package precision: single mixed double

sharad_saurabh · November 16, 2023, 4:13am

Yes I have done that… the debug option are enabled also

sharad_saurabh · November 16, 2023, 4:13am

Hi Akojlmey
I am running this on HPC cluster with 512 GB of RAM space.

sharad_saurabh · November 16, 2023, 4:15am

My Total number of cores are 256 in a single node thats why i choose 8 ,8 ,8 configuration

sharad_saurabh · November 16, 2023, 4:17am

Thanks for the confirmation. As i said earlier GCC 13.1.0 latest release is failing with intel package for rhodo scaled dataset. I tried un-alinging the store and load from intel intrinsic cpp file. But it dosn’t resulted in success.

sharad_saurabh · November 16, 2023, 4:20am

Thanks for this valuable input. Based on my machine configuration and number of cores 888 was working fine with intel and aocc compilers.

akohlmey · November 16, 2023, 4:20am

In your report you were using an older LAMMPS version, not the latest release.
I can run the INTEL package on my laptop with Fedora 38 which uses GCC 13.2.1, just with a smaller system due to lack of RAM.

sharad_saurabh · November 16, 2023, 4:54am

Thanks Akohlmey
Let me try with Latest lammps release and update