BOP potential cannot proceed with multiple cores

_Wang_Jiaqi · December 15, 2019, 4:05am

Dear Lammps users,

I am trying to calculate the interface energy of Al & Al2Cu system at 550 K, with the bond potential order of Al-Cu (https://www.sciencedirect.com/science/article/abs/pii/S092583881631009X?via%3Dihub).
The input file is shown as below. The general procedure is I first relax the system for 100000 steps, and then run another 100000 steps for data production. The system contains 144 atoms.
I run this on our group server, as well as the server in stampede2 and comet of XSEDE resource. The common problem generated from those machines is, when I run with just 1 core, the simulation can proceed, but when I tried to increase up to 12 cores, the simulation stops due to machine errors. The lammps version I am using is lammps-07Aug19. I also attached the data file, potential file, the error message (the input file as well). This error seems to be related with the memory of the machine itself, not the input error.

I would really appreciate if anyone can provide any hint regarding this error. Thanks a lot for the time and patience! I understand that everyone is busy.

--------------------------- Initialization--------------------------------------------------------

units metal

dimension 3

boundary p p p

atom_style atomic

newton on

variable mytemp index 550

variable mypress index 0

variable thermostep index 10

variable timestep index 0.001

variable runstep equal 100000

variable relaxstep equal 100000

variable medium equal {runstep}/{thermostep}

variable material string Al+Al2Cu

#----------------------------Create geometry from a structure file---------------------------------

read_data Al+Al2Cu.data

neighbor 0.3 bin

#----------------------------Apply potential-------------------------------------------------------

pair_style bop

pair_coeff * * AlCu.bop.table Al Cu Al

comm_modify cutoff 15

#----------------------------Thermal setting-------------------------------------------------------

timestep ${timestep}

thermo ${thermostep}

thermo_style custom cpu time step temp pe ke etotal

#----------------------------Energy minization and system relaxation-----------------------------------------------------

fix deletemomentum all momentum 1 linear 1 1 1 angular rescale

group Alalloy type 1

group Cu type 2

group Al2Cu union Alalloy Cu

dump minimization_Al2Cudump all custom 1 mini_Al+Al2Cu_temp${mytemp}.xyz id type x y z

dump_modify minimization_Al2Cudump sort id

velocity all create ${mytemp} 256 dist gaussian

min_style cg

minimize 1e-10 1e-10 5000 50000

reset_timestep 0

undump minimization_Al2Cudump

#---------------------------Run relaxation------------------------------------------------

fix npt all npt temp {mytemp} {mytemp} 0.1 iso {mypress} {mypress} 1

run ${relaxstep}

#--------------------------------data production-----------------------------------------

variable atemp equal “temp”

variable aetotal equal “etotal”

fix Aveoutput all ave/time {thermostep} {medium} {runstep} v_atemp v_aetotal file Avethermaldata_{material}_${mytemp}.results

run ${runstep}

write_data {material}_{mytemp}.data
#---------------------------------------finished---------------------------------------------------------------

Al+Al2Cu.data (4.84 KB)

Al2Cu+Al_C_ADP.indis (2.28 KB)

AlCu.bop.table (448 KB)

error.file (442 KB)

akohlmey · December 16, 2019, 1:48am

Thanks for reporting this. There are actually two issues with this input deck. If you would run with a more recent version of LAMMPS, it would not even be able to read the potential file, since the potential file is incomplete and after we added general checking of incomplete reads to the reading of potential files, this input will stop even earlier.
When i disable this check (so that reading the potential will just keep reusing the last line of the potential file). I can reproduce the error message you are getting. this points to some memory corruption error due to out-of-bounds memory accesses (according to valgrind, those are apparently off by one index) in several places of the PairBOP::sigmaBo() function (lines: 1470, 1483, 2807, 2820).

I am copying the authors of the pair style and the potential file in the hope that they will investigate and can provide a correction for both, the potential file and the source code.

Axel.

_Ward_Donald · December 16, 2019, 2:32am

Sorry, are you trying to run with 12 atoms/core or am I not understanding. The potential can not handle that few atoms with the way it stores neighbors. This simulation should only run on 1 core with this potential.
Don

_Wang_Jiaqi · December 16, 2019, 2:47am

Thanks all for the kind response!

Aexl: Yes, you are right that there are actually two problems with this simulation, and I reproduced the another error that the latest version cannot read the potential file (the error is: unexpected ending of the potential file). The LAMMPS version I used to reproduced the error is https://github.com/lammps/lammps.

akohlmey · December 16, 2019, 3:54am

Don,

Sorry, are you trying to run with 12 atoms/core or am I not understanding. The potential can not handle that few atoms with the way it stores neighbors.

Where are the exact limits for that? and can you elaborate a bit more why this is not possible? There may be options in LAMMPS to change internal settings that you may not be aware of. At the very least there should be a check for when the minimum requirements are not met and either the code should abort or print a suitable warning.

On top of that we have the issue that the AlCu.bop.table potential does cause a short read and thus fails with recent LAMMPS versions that check for such conditionals. This potential file should either be corrected or removed from the LAMMPS distribution.

Axel

_Zhou_Xiaowang · December 16, 2019, 4:17am

Perhaps Don means minimum size but not the minimum number of atoms. BOP goes out to many neighbor layer away so that you want to avoid a neighbor being a period image of another neighbor.
Sounds like the two issues can and should be fixed. What I do not understand, though, is that we have used the BOP extensively without the problems. Is the case referred to here special?

Xiaowang

_Ward_Donald · December 16, 2019, 1:01pm

Xiaowang is correct and I may have jumped to my answer a bit. However, if you get down to 12 atoms per processor with an FCC structure, this structure has 12 nearest neighbors, I am not sure the code would give accurate answers. If you parse the space that small, the cut offs defined cannot be large enough to generate the appropriate neighbor lists from neighboring cells. I am not sure how the code will actually handle that situation, I haven’t really thought about it. Xiaowang is right, the code should run but would generate inaccurate results. We should be able to put in a check that if the size of the cell is parsed to a dimension smaller than the necessary cutoff distance an error is generated. Not sure about why the larger systems would fail. This hasn’t been something that has popped up before.

The second issue, I haven’t used lammps in some time. I will look to see if this is an easy fix.

Don

_Wang_Jiaqi · December 16, 2019, 2:08pm

It seems the problem is centered on the neighbor list of atoms. In my previous simulations, I used the default neighbor and neigh_modify setting. I rechecked the explanation of the neighbor_modify (https://lammps.sandia.gov/doc/neigh_modify.html, shown as below also), and I found an interesting explanation that the LAMMPS can crash due to neighbor settings, and since the BOP generates multiple layers of neighbors of atoms, the lammps can possibly crash. Now I am testing the “page” (number of pairs stored in a single neighbor page) and “one” (maximum number of neighbors of one atom) setting of the neigh_modify command, and see how it can affect my simulations. I will update soon!

Thanks all and have a great day!
Jiaqi

_Zhou_Xiaowang · December 16, 2019, 3:15pm

Just to make sure, the bop differs from other pair styles in that one has to also specify neigh_comm. You did that right?

_Zhou_Xiaowang · December 16, 2019, 3:17pm

We never have to do that though.
Xiaowang

sjplimp · January 3, 2020, 8:55pm

I’m late to this discussion but two comments.

a) the neighbor settings for one and page should be irrelevant to this problem
the defaults are surely adequate

b) For any pair style in LAMMPS, you should get identical answers (module numeric round-off)
on a given problem, no matter how many procs (MPI tasks) you use.
So it should not matter for BOP (or any other pair style) if you have only 12 atoms/core.
Of course it may not be efficient to run that way, but the answers should be correct.
BOP defines a cutoff it needs to have enough neighbors to compute what it needs.
LAMMPS insures there are ghost atoms available out to that distance even if
there are a few (or even zero) atoms on a particular processor.
This is also true even if the overall system is tiny, e.g. 12 atoms total.
You should still get identical answers no matter how many procs you use.

If it is not working that way, there is likely a bug someplace, either in pair BOP

Steve

_Wang_Jiaqi · January 3, 2020, 9:39pm

Hi Steve,

Thanks a lot for your comments, which makes it clearer.

Just a little update from my side that, with using the updated version of the BOP potential file, my simulations have been running well!

Thanks again for all your effort and have a good weekend!

Best,
Jiaqi

_Zhou_Xiaowang · January 3, 2020, 10:02pm

Thanks Jiaqi for the update. We will send a new potential file for a future lammps release. Not sure why the existing file is truncated.
Xiaowang