[EXTERNAL] Re: Segmentation fault while using KOKKOS

akohlmey · July 8, 2020, 11:08am

Dear Stan and Axel,

Axel:
Thank you very much for your helpful return. I didn’t answer your mail because I am following your advices and trying to debug step by step as you clearly stated in your mail. I have not yet successfully solved the issue tho, even after commenting ALL fixes and doing a run 0, as soon as I uncomment the fix qeq/reax (needed for my script since I am using ReaxFF) I get the SegFault. But I am digging step by step where the Issue could be.

you can save a lot of time, if you do this in order and specifically get a meaningful stack trace first.

from looking at your input script, that would confirm quickly, that your input is at fault and that the issue has nothing to do with KOKKOS:

you are applying fix qeq/reax to all atoms (and atom types) but ReaxFF only to the first atom type. that cannot be right and is the most likely cause for problems, since in the fix qeq/reax command line you are asking the ReaxFF pair style to provide the parameters required for charge equilibration. how can it have parameters for atom types it doesn’t know anything about? apart from that, this is a bad model, anyway.

… and why use ReaxFF for this kind of system where Tersoff may do just as well with a fraction of the compilations and less computational effort?

axel.

Gregoire_Defoort · July 8, 2020, 11:32am

Dear Axel,

Indeed I was (very) mistaken by the fix qeq/reax command. I did the modification by assigning it only to the Silicon atom types:

group Si_atoms type 1 1
fix reaxc1 Si_atoms qeq/reax 10 0.0 10.0 1.0e-6 reax/c
From the log I see that all the (5248) Si atoms are correctly asigned to the group, so therefore the fix should apply only to these atoms. But when running the script the segmentation fault still appears.

I am very eager to understand why do you consider the model to be bad? Is it because ReaxFF is not suitable for such simulations ? Or is it because I have wrong inputs ? I am totally aware that my knowledge here is incomplete and I can’t thank you enough for your comments.

In the framework of my project, we decided to use the ReaxFF potential in order to observe the chemical interactions when performing very low energy sputtering (from 50 to 500eV) that’s why I am not using Tersoff (even tho I performed some simulations with it to compare).

Grégoire

akohlmey · July 8, 2020, 11:56am

Dear Axel,

Indeed I was (very) mistaken by the fix qeq/reax command. I did the modification by assigning it only to the Silicon atom types:

group Si_atoms type 1 1
fix reaxc1 Si_atoms qeq/reax 10 0.0 10.0 1.0e-6 reax/c
From the log I see that all the (5248) Si atoms are correctly asigned to the group, so therefore the fix should apply only to these atoms. But when running the script the segmentation fault still appears.

it probably is still because of having more atom types in the system than what reaxff knows about and thus when querying reaxff for parameters, i suspect it segfaults when trying to access elements in internal data structures that are not present. i would try with an explicit file and look up the necessary parameters from the reaxff potential file and supply dummy parameters for the rest. that is why i am nagging so much about getting a proper stack trace. that will tell you where exactly the segfault happens and thus eliminate a lot of guessing and wasting time on chasing issues that are irrelevant.

that said it is possible (even though with limited scientific value) to run a reax/c pair style input without fix qeq/reax, if you use the “checkqeq no” options to pair style reax/c. check out the manual!!

I am very eager to understand why do you consider the model to be bad? Is it because ReaxFF is not suitable for such simulations ? Or is it because I have wrong inputs ? I am totally aware that my knowledge here is incomplete and I can’t thank you enough for your comments.

ReaxFF is designed as a holistic model, i.e. all atoms should be described by it and you need a specific parameterization for the specific kind of system. ReaxFF parameters are not very portable. Whenever people use ReaxFF in hybrid models, one has to check very carefully, how much the results are affected by going against a fundamental design principle of the model. Using hybrid styles in LAMMPS is always a compromise, but for ReaxFF I would consider the problems stemming from it the largest. hybrid models can work ok for mixing multiple pairwise additive potentials (although they often have different strategies for balancing the coulomb and non-coulomb non-bonded interactions, so the cross-interactions can be a big problem as mixing does not automatically result in balanced parameters), or when having separate entities like a workpiece and a tool modifying it, or when there are two parts of a system that don’t interact with each other. when using embedded atom potentials, things become more difficult due to the incorrect treatment of the embedding term (each substyle doesn’t “see” the other and thus is missing the mutual embedding contributions) and anything that uses charge equilibration is particularly difficult as that leads to all kinds of inconsistencies and complications when the charge equilibration has to be applied to a subset of atoms only. You always have to distinguish between

In the framework of my project, we decided to use the ReaxFF potential in order to observe the chemical interactions when performing very low energy sputtering (from 50 to 500eV) that’s why I am not using Tersoff (even tho I performed some simulations with it to compare).

what kind of “chemical interactions” of what? from the outset that sounds more like you should be using some quantum code, perhaps some semi-empirical tight-binding approach?

axel.

Gregoire_Defoort · July 9, 2020, 8:36am

Dear Axel,

First of all, thank you for your comment. Having a better understanding of the nature of the ReaxFF potential is very helpful for my simulations.

I have decided to give a try and switch to a full reaxFF system by adding the parameters I found for Argon particles with the ReaxFF potential (that I found in a publication).

So I created this “dummy file” with the previous reaxFF I was using that was developped for conditions similar to my simulations, and carefully appended the Argon part. Since the Argon isn’t bonding with the other particles of the file, I could simply append the initial parameters without having to dig with the bond terms and so on and so forth (I checked in the original ReaxFF file I took the parameter from and it is the same: Ar doesn’t bond nor have off diagonal terms, angles or Torsion terms with the other particles). I am still trying to use KOKKOS to run this simulations, and I recompiled in debug mode with the latest LAMMPS patch (30Jun2020).

The input lines of my script have barely changed, I just uncommented the 2 potential lines with the ReaxFF stand-alone potential, and commented the hybrid part. I also restored the fix qeq/reax with all atoms, since the Ar is now included in the ReaxFF file, which should give good performances (for that part at least):

pair_style reax/c NULL
pair_coeff * * ffield.reax.SiOCH_Ar_test Si Ar X X X X X X
(line 95 of the script)
fix reaxc1 all qeq/reax 10 0.0 10.0 1.0e-6 reax/c
(line 153)

I do not have a SegFault anymore but I obtain these kind of errors:

what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:143
Traceback functionality not available

[gpu004:400526] *** Process received signal ***
[gpu004:400526] Signal: Aborted (6)
[gpu004:400526] Signal code: (-6)
[gpu004:400526] [ 0] /usr/lib64/libpthread.so.0(+0xf5d0)[0x2aaab350b5d0]
[gpu004:400526] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x2aaab3fe8207]
[gpu004:400526] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x2aaab3fe98f8]
[gpu004:400526] [ 3] /trinity/shared/apps/tr17.10/x86_64/gcc-7.2.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x125)[0x2aaab37a9225]
[gpu004:400526] [ 4] /trinity/shared/apps/tr17.10/x86_64/gcc-7.2.0/lib64/libstdc++.so.6(+0x8eff6)[0x2aaab37a6ff6]
[gpu004:400526] [ 5] /trinity/shared/apps/tr17.10/x86_64/gcc-7.2.0/lib64/libstdc++.so.6(+0x8f041)[0x2aaab37a7041]
[gpu004:400526] [ 6] /trinity/shared/apps/tr17.10/x86_64/gcc-7.2.0/lib64/libstdc++.so.6(+0x8f284)[0x2aaab37a7284]
[gpu004:400526] [ 7] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29b4b46]
[gpu004:400526] [ 8] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29bca31]
[gpu004:400526] [ 9] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x54f681]
[gpu004:400526] [10] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29bc916]
[gpu004:400526] [11] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29be6b3]
[gpu004:400526] [12] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29b0b5f]
[gpu004:400526] [13] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29b30ef]
[gpu004:400526] [14] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x751fef]
[gpu004:400526] [15] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x749702]
[gpu004:400526] [16] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x9a2049]
[gpu004:400526] [17] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x99d718]
[gpu004:400526] [18] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x8268cb]
[gpu004:400526] [19] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0xa04c17]
[gpu004:400526] [20] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x534dcc]
[gpu004:400526] [21] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x41c6db]
[gpu004:400526] [22] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x4105a6]
[gpu004:400526] [23] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x40cfa8]
[gpu004:400526] [24] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x40b50f]
[gpu004:400526] [25] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab3fd43d5]
[gpu004:400526] [26] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x40b3d9]

I also wanted to comment on the fact that the script I previously sent can run fine on CPU (with for instance the icc_openmpi compiled version) even with the faulty fix qeq/reax command. It is solely when I try to switch to GPU that the errors appears. I (now) understand way more why the script wasn’t a good model but why can it work with CPU and not GPU ? Is it because the GPU compilation are more sensitive to bad parameters / have a different way of setting the neighbor lists ?

To answer your question for chemical interactions, we plan to observe oxidations / water - Silicon interactions in the future, and we tought that would be better described with ReaxFF than Tersoff / Stillinger-Weber / ZBL. That’s why I am trying to use this potential. But the tight-binding approach is a clever suggestion.

Thank you again for the comments and all the help,

Best regards,
Grégoire

akohlmey · July 9, 2020, 1:38pm

I do not have a SegFault anymore but I obtain these kind of errors:

what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:143

this is essentially equivalent to a segmentation fault on the GPU device.

Traceback functionality not available

[…]

[gpu004:400526] [ 7] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29b4b46]
[gpu004:400526] [ 8] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x29bca31]
[gpu004:400526] [ 9] /home/int/sam/defoort/lammps/lammps-patch_30Jun2020/build_07072020/lmp_mpi[0x54f681]

stack traces from executables without function/filename/line info is useless. I think I mentioned that already twice. If you are figuring out why something crashes, this provides vital information. you are just making your and our life needlessly more difficult.

I also wanted to comment on the fact that the script I previously sent can run fine on CPU (with for instance the icc_openmpi compiled version) even with the faulty fix qeq/reax command. It is solely when I try to switch to GPU that the errors appears. I (now) understand way more why the script wasn’t a good model but why can it work with CPU and not GPU ? Is it because the GPU compilation are more sensitive to bad parameters / have a different way of setting the neighbor lists ?

let me guess, you have never tried a hand at GPU programming, right?

the first cause of action here is to compile a plain KOKKOS binary, i.e. without GPU (and OpenMP) support and figure out whether that fails as well.
If yes, you need to run with a debugger and/or valgrind’s memory checker to identify the location and cause of the incorrect memory access and which part of LAMMPS triggers it. If it is only the GPU version, then there may be a problem where host and device data is not updated as needed, or some unexpected and unchecked data manipulation done.

To answer your question for chemical interactions, we plan to observe oxidations / water - Silicon interactions in the future, and we tought that would be better described with ReaxFF than Tersoff / Stillinger-Weber / ZBL. That’s why I am trying to use this potential. But the tight-binding approach is a clever suggestion.

If there isn’t a ReaxFF parameterization specifically for this purpose, then quantum chemistry is your only option.

Axel.