Parallelization problem with reax/reaxc pair style

Longtao_Han · October 10, 2015, 12:45am

Hi LAMMPS users,

I recently was stuck by problems using LAMMPS with reax/reaxc pair style for parallel simulation. Any help is greatly appreciated!

I tried to simulate a process of CNT growth on Ni nanocatalyst. Below is the input file (Current one is for reaxc, if substitute the corresponding line with the comment, it will be the input for reax pair style.)

CNT growth on Ni nanocatalyst

units real
atom_style charge
boundary p p p
pair_style reax/c NULL safezone 16.0 mincap 1000
#pair_style reax

read_data ni_ann.lammps
lattice fcc 1.0
region sim block -30. 30. -30. 30. -30. 30. units box
region dep1 block -30. 30. -30. 30. 20. 30. units box
region dep2 block -30. 30. -30. 30. -30. -20. units box
region dep3 block 20. 30. -30. 30. -30. 30. units box
region dep4 block -30. -20. -30. 30. -30. 30. units box
region dep5 block -30. 30. -30. -20. -30. 30. units box
region dep6 block -30. 30. 20. 30. -30. 30. units box
region deps union 6 dep1 dep2 dep3 dep4 dep5 dep6

group puf id > 126

pair_coeff * * ffield.reax.NiCH Ni C
#pair_coeff * * ffield.reax 4 1

neighbor 2 bin
neigh_modify every 10 check no

fix 1 all qeq/reax 1 0.0 10.0 1.0e-6 reax/c

Above line will be commented for reax style

fix ins puf deposit 200 2 5 12345 region deps
delete_atoms overlap 1.0 puf puf
fix trmout all nve
fix tom all langevin 1500. 1500.0 100 248158
thermo_modify lost ignore flush yes
compute_modify thermo_temp dynamic yes

timestep 0.2
thermo 10000
dump 2 all xyz 10000 imp.xyz
dump 3 all custom 10000 imp.dat id type q x y z vx vy vz

run 1000000
undump 2
undump 3

Problem for reaxc pair style:

Every time I run the code in one node(16 cores) on cluster, it works well. But when I tried to run in multiple nodes (even 2), I will get either of the two errors running within 10s. This problem is not related to the machine, because I tried it on computing clusters of several places, and the problem is the same.
1)Segmentation fault (signal 11)
or
2)not enough space for bonds! total=… allocated=… (former is several orders of magnitude larger than latter)

I even tried to run the examples in tarball, they all failed when using multiple nodes. And the error is Segmentation fault.
I also tried to increase the safezone and mincap parameter, but still not working.

Problem for reax pair style:
I encountered the error: “Too many bonds on atom. Increase MBONDDEF” even with one node. So I changed the MBONDDEF from 20 to 40, and recompiled lammps. It worked, but here come two new problems:

The benchmark speed for multiple nodes is the same or even worse than one nodes.
The trajectory of simulation makes no physical sense (No carbon hexagons or pentagons formed. At least for reaxc with one node I can find carbon rings on Ni after some time).

On my school cluster, I used LAMMPS 10Aug15 version, compiled with mpiicpc (reax library compiled with mpiifort). I don’t know the environment of LAMMPS on other clusters, but the problems are all the same.
My data file is also included. The reaxFF file is obtained from Adri van Duin, which I don’t know if it’s proper to put here directly.

Thank you so much!

Best regards,
Longtao Han

PhD student

Dept. of Materials Science and Engineering
Institute for Advanced Computational Science
Stony Brook University

ni_ann.lammps (8.88 KB)

Ray_Shan · October 10, 2015, 8:46pm

These errors usually indicate you have a bad strucutre, e.g., atoms overlapping, atoms exceptionally close to each other, and/or bad boundary conditions. Why it is only showing up on multiple nodes may be due to the number or MPI processes (local and ghost atoms change with number of MPI processes). Visualize your system, and perhaps run a minimization first.

Ray

Longtao_Han · October 18, 2015, 9:44pm

Thank you so much for your advice, Ray.

I took your advice and run a minimization first. And I run into the similar memory allocation problem:

step26-bondchk failed: i=0 end(i)=0 str(i+1)=-1324298520

ERROR: failed to allocate 162793593504 bytes for array list:three_bodiesapplication called MPI_Abort(MPI_COMM_WORLD, -14) - process 18

Then I searched mailing list and found a case similar to mine:

http://ehc.ac/p/lammps/mailman/message/27112041/

where he also could run script on 1 node but not on multiple. But I think the fix mentioned there should be included in the current version of lammps.

In terms of the structure, I always visualize the simulation after run. And I measured the bond lengths of Ni-Ni in the nanocatalyst (this is the most dense part of the system), and the bond lengths are normal. Then to preclude the possibility of my bad structure, I did more trials on the examples in the lammps tarball (Also I did before.)
I increased the run step to 100000 for all 7 examples and perform MPI run. The run of 6 examples (except for AuO) were terminated by “segmentation fault 11”. I visualized all 7 simulations, and noticed only AuO is starting from crystallized structure and in the following simulations, the atoms only had small oscillations in position. In the other examples, the atoms had large scale movement (but no overlapping or atoms getting too close).

So is it possibly a problem with the MPI environment or compilation of lammps? Or could it be some cautions when simulating chemical reactions, which I am not aware of?

Thank you for any advice!

Best,
Longtao

Ray_Shan · October 19, 2015, 3:42am

Thank you so much for your advice, Ray.

I took your advice and run a minimization first. And I run into the similar memory allocation problem:

step26-bondchk failed: i=0 end(i)=0 str(i+1)=-1324298520

That is a very weird number ( large and negative) and it certainly indicates something wrong in your setup, structure or code. How many MPI processes did you use? What version of LAMMPS? What is the max number of MPI that you can run this without an error? What is your aystem size?

ERROR: failed to allocate 162793593504 bytes for array list:three_bodiesapplication called MPI_Abort(MPI_COMM_WORLD, -14) - process 18

This is one large size. Again your structure, setup or compiled executable is problematic.

Then I searched mailing list and found a case similar to mine:

http://ehc.ac/p/lammps/mailman/message/27112041/

where he also could run script on 1 node but not on multiple. But I think the fix mentioned there should be included in the current version of lammps.

In terms of the structure, I always visualize the simulation after run. And I measured the bond lengths of Ni-Ni in the nanocatalyst (this is the most dense part of the system), and the bond lengths are normal. Then to preclude the possibility of my bad structure, I did more trials on the examples in the lammps tarball (Also I did before.)
I increased the run step to 100000 for all 7 examples and perform MPI run. The run of 6 examples (except for AuO) were terminated by “segmentation fault 11”. I visualized all 7 simulations, and noticed only AuO is starting from crystallized structure and in the following simulations, the atoms only had small oscillations in position. In the other examples, the atoms had large scale movement (but no overlapping or atoms getting too close).

How many MPIs? Note in these examples molecules and atoms by no means can’t have large displacements. Therefore this test of your is not helpful.

So is it possibly a problem with the MPI environment or compilation of lammps? Or could it be some cautions when simulating chemical reactions, which I am not aware of?

Hard to say. Please think about my questions and also update your LAMMPS version.

Ray

akohlmey · October 19, 2015, 12:54pm

in addition to ray’s suggestions, please also look into your neighborlist and thermo settings.
using:

neigh_modify every 10 check no

is a somewhat aggressive setting, how about?

neigh_modify every 1 check yes

also, “thermo_modify lost ignore” is extremely worrisome. when you have to ignore lost atoms to complete a simulation where you do not really expect atoms to leave the simulation cell, that would be an indicator of bad simulation parameters. please keep in mind that with a different number of processors in use, you have a different geometry of your subdomains and thus the possibility of triggering a problem, that would otherwise by chance not get triggered.

the fact that you turn off any indication of such problems (you cannot see dangerous builds or lost atoms), lets the simulation continue in a bad state until things go awfully wrong.

axel.

Longtao_Han · October 20, 2015, 6:23pm

Hi Ray Axel,

Thank you so much for your suggestions.

Regard to Ray’s questions:

I paste my input code below and attached the data file (it’s a system of 126 atoms with deposit 200 atoms in the run).
I also tested the code on Titan in ORNL (Cray XK7, LAMMPS version 15May15). Still have segmentation fault: Application 9513958 exit codes: 139 (Sorry I don’t know how to get a more detailed error log on Titan.)
The max number of MPI processes I can run without any error is 24, then move to 32 processes it will fail.

About the examples, I don’t quite understand “by no means can’t have large displacements”. Did you mean: the molecules and atoms in those examples will definitely have large displacements?
Anyway, I did the same tests of examples on Titan. With 8 MPI processes, AB, CHO, FeOH3, RDX, VOH, ZnOH2 examples failed as before, AuO can still run without problem. With 4 MPIs I only tried CHO, it failed again, just after a longer run than 8 MPIs.

Regard to Axel’s questions:

I modified the code to: neigh_modify every 1 check yes, but still got segmentation error.
I put “thermo_modify lost ignore” here because I use “fix deposit” to introduce atoms into the surroundings of original atoms. But I want to avoid the problem that the newly deposited atom getting too close to prior deposited atom. So I also used “delete_atom overlap” to prevent the deposited atom being overlap. In fact, the simulations seems stopped before depositing first atom when using large MPIs. With 24 or less MPIs, it can run 1 day without problem.

Thank you so much!

Best,
Longtao

ni2c_ann.lammps (9.88 KB)

akohlmey · October 20, 2015, 6:30pm

Hi Ray Axel,

Thank you so much for your suggestions.

Regard to Ray's questions:

I paste my input code below and attached the data file (it's a system of *126
atoms with deposit 200 atoms* in the run).
I also tested the code on Titan in ORNL (Cray XK7, LAMMPS version
15May15). Still have segmentation fault: *Application 9513958 exit codes:
139* (Sorry I don't know how to get a more

am i getting this right, that your *entire* system has only 126 atoms and
then you try to deposit 200 atoms.
to how many processors are you planning to scale this? is it actually
getting significantly faster with increasing the number of processors?

Longtao_Han · October 20, 2015, 6:47pm

Hi Axel,

Yes, your understand is right for this code.
I don’t quite familiar with the parallel performance with reaxFF so I don’t actually know how many processors are needed. Does this force field only works better with small MPIs?
This is a first step simulation. Actually later I wish to simulate a system with thousands of depositing atoms with inert gas in the environment. Also the simulation time will be longer than 100ns (5e8 steps).

Best,
Longtao

akohlmey · October 20, 2015, 6:58pm

Hi Axel,

Yes, your understand is right for this code.
I don't quite familiar with the parallel performance with reaxFF so I
don't actually know how many processors are needed. Does this force field
only works better with small MPIs?

how well a code parallelises and what is the limit depends on several
factors:
- the parallelization strategy
- the cost of communication
- the load (im)balance)
- amount of computation per parallel unit

with a small number of atoms, chances are high that you have no gain in
using more processors. also, some codes have problems when the work
distribution fluctuates a lot.
when using a "cheap" potential like lennard-jones, the limit of scaling is
often reached with a few hundred atoms per MPI rank, assuming perfect load
distribution (i.e. a homogeneous dense system). with more costly
protentials, this number can drop.

on the other hand, your system description indicates a very inhomogeneous
particle distribution, which will lead to load imbalance and thus using
more processors may only add overhead for communication, but not improve
parallelization and thus make your simulation slower.

the only way to know for certain, is to make strong scaling benchmarks.

in addition, the deposition and the expected assembly will put a strain on
the memory management strategy of the code in the USER-REAXC package.

This is a first step simulation. Actually later I wish to simulate a
system with thousands of depositing atoms with inert gas in the
environment. Also the simulation time will be longer than 100ns (5e8 steps).

running for 100ns seems like a very optimistic goal considering that you
have not made any benchmarks.

axel.

Longtao_Han · October 20, 2015, 7:22pm

Hi Axel,

Yes, your understand is right for this code.
I don't quite familiar with the parallel performance with reaxFF so I
don't actually know how many processors are needed. Does this force field
only works better with small MPIs?

how well a code parallelises and what is the limit depends on several
factors:
- the parallelization strategy
- the cost of communication
- the load (im)balance)
- amount of computation per parallel unit

with a small number of atoms, chances are high that you have no gain in
using more processors. also, some codes have problems when the work
distribution fluctuates a lot.
when using a "cheap" potential like lennard-jones, the limit of scaling is
often reached with a few hundred atoms per MPI rank, assuming perfect load
distribution (i.e. a homogeneous dense system). with more costly
protentials, this number can drop.

on the other hand, your system description indicates a very inhomogeneous
particle distribution, which will lead to load imbalance and thus using
more processors may only add overhead for communication, but not improve
parallelization and thus make your simulation slower.

Thank you so much for your detailed explanation. It teaches me a lot!
Does it mean if my code works for more MPIs, there is a higher chance I can
boost the simulation if I put in more atoms and thus make the system more
homogeneous?

the only way to know for certain, is to make strong scaling benchmarks.

in addition, the deposition and the expected assembly will put a strain on
the memory management strategy of the code in the USER-REAXC package.

This is a first step simulation. Actually later I wish to simulate a
system with thousands of depositing atoms with inert gas in the
environment. Also the simulation time will be longer than 100ns (5e8 steps).

running for 100ns seems like a very optimistic goal considering that you
have not made any benchmarks.

The code with current system size can run about 6e6 steps in one day with
16 MPIs (1 node), so I am expecting it runs parallel.

axel.

Best,

Longtao

akohlmey · October 20, 2015, 7:46pm

what you are asking about is something that you should have been instructed about before running parallel calculations on a parallel machines. doing proper benchmarking and performance evaluation is part of the standard workflow of any responsible user so that you don’t waste resources and your own time. the LAMMPS specific parts are outlined in the manual, but it will be pretty useless to study that before you have a decent grasp on strong vs. weak scaling, amdahl’s law, and made some suitable benchmarks to evaluate how a specific potential performs. there is quite a few material on this available online and there is next to nothing on this topic that is as instructive as discussing with experienced senior colleagues and advisers.

axel.

Longtao_Han · October 20, 2015, 8:00pm

Sure. Thank you for teaching me a lot and I do understand the big gap I have on parallel calculations. It’s really not suitable to raise such a simple question here.

Thank you again!

Best,
Longtao