Debugging 3rd party potential

Dave_Schall1 · February 6, 2024, 9:56pm

Hello,

I am using a 3rd party AIREBO style potential for MoS2 (by Stuart and Spearot). The original authors no longer have funding for MoS2 research so it’s up to me to get it working for my own research. It was created by modifying an older version of AIREBO. I recently fixed it so it would compile in the Aug-2-2023 version of LAMMPS. I had to change lots of memory paging and related things. I used a current version of AIREBO as a guide. And it runs, as expected, some of the time. The problem is it is very temperamental.

I get crashes when trying to run on different numbers of processors (maybe a job runs on 24, but not 48 cores for example), and sometimes, if it crashes, I can simply restart with the exact same conditions and the simulation and will run. By crashes I mean NAN energies, etc.

This seems like some kind of memory related error, but I am really not sure how to check it. Works okay most of the time on a small number of processors, but I want to scale up my systems a bit and need more cores so I need to address this issue.

My question is how does one go about checking for memory related errors? Are there any procedures or templates for including a new potential in LAMMPS? I feel like the functional form is correct, but something with neighbor sharing, MPI, communications etc is not working right. And I need some hints where to look.

Thanks in advance, Dave

akohlmey · February 6, 2024, 10:19pm

Are you by any chance talking about the REBOMOS pair style?
This is provided in an updated version also here: GitHub - lammps/lammps-plugins: Collection of LAMMPS plugins

The first thing that I usually do is to run with valgrind’s memory checker tool.

Please see: 3.2. Submitting new features for inclusion in LAMMPS — LAMMPS documentation

If my hunch from above is correct, you should first try the source code from the lammps-plugins repo and see if it has the same issues. The reason why we put code into the lammps-plugins repo is that they have been ported to recent LAMMPS versions, but are otherwise untested (i.e. they compile and the porting was done by LAMMPS developers who know which changes need to be applied, e.g. typically what is listed here: 4.10. Notes for updating code written for older LAMMPS versions — LAMMPS documentation) and typically there are no confirmed results and no documentation compatible. This is what is required to be included in the LAMMPS distribution (some researchers have a habit to publishing source code with a paper and assume the source code and the paper are sufficient documentation for others). When compiled as a plugin, those styles can be added to a binary LAMMPS distribution without having to compile LAMMPS itself from source, but the source code for the pair style itself is also compatible with adding it to the LAMMPS distribution. To become a plugin, only a plugin loaded needs to be added.

If this doesn’t help, I would suggest you download an old LAMMPS version that is compatible with the unmodified pair style code and see if the issues persist. Then you would have confirmed that the errors are part of the implementation and not the porting. How to continue from there depends on one of multiple things:

you are willing and able to teach yourself the necessary debugging skills and understanding of the implementation and correct all issues
you can convince somebody with those skills to collaborate with you on improving the code (typically that requires offering something of value in exchange for the help)

In any case, to include the finished code, you need to follow the requirements (and ideally also the suggestions) given in the LAMMPS manual, get an official approval of the original authors and then submit a pull request here: Pull requests · lammps/lammps · GitHub

akohlmey · February 19, 2024, 1:00am

@Dave_Schall1 do you have any updates (positive or negative) on this?
Do you need further assistance?

Dave_Schall1 · February 19, 2024, 1:56am

Hi Axel,

Thanks for checking back in.

The short answer is yes, they are the same potential, but the error persists even when I use the plugin version. Some combos of procs run, others fail. I was working on generating a test case that demonstrates the failure but got sidetracked with a paper and a proposal.

I’ll follow up soon.

Dave

Dave_Schall1 · February 19, 2024, 8:01pm

As promised, here is an example set of starting coordinates and a simple input script that breaks pretty consistently for me. To “break” the script I simply added a very small random displacement to all atoms (see line 8). I am running the Aug-23 version of lammps with the rebomos plugin. (Note I can get the potential to break many ways, but this seemed to be a really simple way to demonstrate.)

Coord file: bulk_better_S.full - Google Drive
Script file: in.rebomos-bulk_read - Google Drive

Running with 32 cores, commenting line 8 out runs fine. (output cropped for brevity)

  Step          Temp          Press          PotEng         KinEng         Volume    
         0   0              321671.39     -480280.38      0              1312300.7    
         1   0              231090.59     -495232.33      0              1312300.7    
         2   0              164369.99     -504363.54      0              1312300.7

Running with 32 cores, uncommenting line 8 causes the run to return NAN:

   Step          Temp          Press          PotEng         KinEng         Volume    
         0   0             -nan           -nan            0              1312300.7

Running with 96 cores dies no matter what. (line 8 commented or not).

akohlmey · February 19, 2024, 9:26pm

Can you reproduce the NaN with fewer processors?
And for a smaller system?

akohlmey · February 19, 2024, 10:08pm

Update: I have been able to reproduce the NaN and found the origin. There is a piece of code where it takes the square root of a number that may become negative.

akohlmey · February 20, 2024, 3:28am

@Dave_Schall1 I managed to spot a very serious bug that would have prevented using the pair style correctly in parallel at all: not all parameters in the potential file were communicated from rank 0 to all other ranks. Accessing the undefined/uninitialized data on the remote processes would result in bad values which then triggered the square root of negative numbers which in turn is responsible for the NaNs.

I also spotted some missing contributions to the (global and per-atom) virial, simplified some code, and silenced a ton of warnings due to unused variables.

You can download the updated files from https://github.com/lammps/lammps-plugins/tree/develop/USER-REBOMOS

When running in serial, the forces after a run 0 should be the same with the old version and the new. If not, I will have to take another look.

Update: in keeping with the subject of the original question, I should add some comments about how I tracked down the issues.

First I compiled LAMMPS with a static library setting (the default), with debug info included, and with the define -D LAMMPS_TRAP_FPE=1 set during compilation. That enables trapping floating-point exceptions and creation of coredumps.
I ran and it created an exception in the bondorder() method and coredump when the NaN would otherwise appear. I loaded the coredump in the debugger and saw that it was triggered on a line with a square root and that the values would add to a negative number.
There were no obvious reasons to assume incorrect code from inspecting the source in that function
Now I reran the calculation with valgrind’s memcheck tool. This confirmed the location of the issue and reported access to uninitialized data.
Next I added a separate print statement for each of the variables used in the expression to identify which had uninitialized data. This way I would find that it was the Etmp variable, which was properly initialized, so it had to be one of the expressions that it was initialized from.
This led me further through the code to ultimately find that it were the spline coefficients in the gSpline() function.
Finally, I looked at the read_file() function and noticed that MPI_Bcast() was applied to some arrays with polynomial parameters only with a length of 1 instead of 2 as was the size in the class definition. Changing this to 2 removed created valid values for the square root and the NaN and the report of accessing uninitialized data were gone.

In the process and while comparing the pair_rebomos.cpp file with pair_airebo.cpp (from which it was adapted a loooong time ago) I noticed the inconsistent handling of virial tallying (I had done a refactor and review of this for manybody potentials a few years back and fixed a few inconsistencies there, but since the code was not part of the LAMMPS distribution, it had not had that refactor, but pair_airebo.cpp had), also there were new pairwise force terms added (albeit extremely small for the test structure) and for those the corresponding virial tallying calls needed to be added.

Hope this summary helps people that look to do similar things in the future.

Dave_Schall1 · February 20, 2024, 2:25pm

Thanks Axel,

I ran some quick tests on several different systems where I would sometimes have issues. They all seem to run as expected. I will run some longer runs when I get a chance. Occasionally I’ll would have a job run fine for 100’s or 1000’s of steps, then randomly start spitting out NANs. It’s probably the same issue. I could never pin it down on any of the common reasons why a job would crash (bad initial config, wrong timestep size, etc.)

Based on what you found, I am surprised these ran at all. I mostly run on workstations (2 processors / 96 cores). I guess since the cores share memory, by some dumb luck the same memory address was being used, even though not specifically allocated for use, by all cores some of the time, but not all of the time.

Thank you for your time and effort. I very much appreciate your help in solving this issue.

Dave

akohlmey · February 20, 2024, 2:44pm

No, shared memory has nothing to do with this.
This is due to the malloc() implementation in the GNU libc which is used on Linux.
This reserves blocks of “free” memory by using mmap() on /dev/zero with a copy-on-write flag.
This enables reserving very large chunks of address space at almost no cost as all the pages are initially mapped to the same page. This also makes efficient use of physical memory and avoids occupying swap space for applications that allocate large chunks of address space during initialization but then use only parts of it. This happens often in desktop and especially graphical applications.

For scientific applications, this is often not the case, since memory allocation is usually only done on demand and after computing how much space is needed, since one has to often carefully budget the available RAM (or at least had to in the old days of HPC, today machines have rather plentiful RAM by comparison).

There are two downsides to this malloc() strategy:

You can quickly reserve a lot of RAM without penalty, but then you run out of RAM later when the various pages are modified and converted from shared memory pages to individual memory pages. This then can lead to heavy swapping and eventually a crash when there is no more swap less, but the machine will suffer a lot before it dies. This is why Linux distribution run an “early OOM daemon” that will try to kill processes that suddenly grow their resident memory usage before the machine grinds to a halt.
All initial memory allocations have all bytes set to 0. So your spline evaluation will just return 0. If that is for a property that is close to 0 anyway the result may not show that much. Also, this only applied to certain atom types and certain conditions. The implicit initialization to 0 often hides that there was no proper initialization. This will not always show, but can go unnoticed for as long as the address space is not freed and then later reuse (at which time it will have random data). In LAMMPS this happens often when you issue the “clear” command and then start a second simulation. That will now have allocated memory with random data in it. This is why valgrind is such a valuable tool here. We try to take this into account in the unittest branch where we often retest things after a “clear” command to make certain we get failures (segfault, bad/inconsistent results) if something was not properly initialized. Sadly, the unit test facility only covers about 1/3rd of the code base, so there is much work left to do to “harden” LAMMPS.

Dave_Schall1 · February 20, 2024, 10:33pm

Thank you for the clarification. The energies before fixing the code and after fixing the code do vary slightly and drift away from each other over time but they don’t seem to affect the overall result much. I just reran a phonon density of states calculation that requires about 350 ps of simulation time to complete (so long enough to drift pretty significantly). The final phonon spectra are almost identical. The difference is just in the noise but the important stuff (peak position, magnitude, width) all look the same. This is reassuring.

akohlmey · February 21, 2024, 1:00am

Thanks for the feedback. I am not surprised the differences are small, otherwise the NaN would happen more easily.

If you would now write a page for the manual describing this potential, probably starting from the AIREBO pair style docs as template, this all could be submitted to the LAMMPS repository and become a regular supported pair style (assuming the original authors are ok with it). I can then put this all together and prepare the pull request (there are a few more “administrative” changes required, but they are trivial to do for me). I would do this all by myself, but I don’t know enough about the science.

Dave_Schall1 · February 21, 2024, 1:38pm

No problem. I’ll reach out to the authors. They were aware there was an issue, so I am sure they will be happy to know it was resolved.