Reaxc preformance

Dear all, I've profiled two examples available in the distribution for reactive force field calculation. Those are under the directories examples/reax AB and Au_O.
I've slightly modified the two input script by changing run 3000 to run 30000 in the file in.AB while in the file in.AuO I've added the command replicate 2 2 2.
The AB(Ammonia Borane) system has 104 atoms while the AuO example has 7680 atoms.
Here the profiling data for the two systems, the time is express in second and is the sum of the time spent by each processor

*AB*
#cores write_reax_lists get_distance Copmute_Forces Total
1 451 200 681 1184
2 563 244 984 1718
4 738 316 1484 2809

*AuO*
#cores write_reax_lists get_distance Copmute_Forces Total
1 425 184 237 718
2 478 208 282 825
4 548 238 348 976

What the routine write_reax_lists does? It seems terribly expensive and in percentage seems to take more as the system size increase.
Do you expect this kind of behavior?
The method get_distance evaluates the distance between two particles and it takes almost half of the time of write_reax_lists method.
As the system become bigger the method write_reax_lists is more expensive than Compute_Forces, and this system is just 7K atoms.
If I would like to run a system with few hundred thousands of atoms this method become a serious bottleneck.

I have some effort to put in order to optimize the code if you think that is necessary.

Many thanks

Nicola

I don't know the innards of the Reax/C code. Maybe Ray
can comment. You can also contact the author Metin directly
if you like. Both are CCd. You're welcome to optimize if you want, but the
Purdue group did a lot of nice optimization work already.

Running a 100 atom system in parallel is pretty hopeless
for parallel scaling.

On the 7000 atom system you are showing good speed-up,
something like 75% efficient. We've benchmarked Reax?C
out to 32M atoms on 1024 cores and it scales nicely.
See our recent MRS Bulletin paper, on the citations
page of the WWW site.

Steve

Hi Nicola,

This is very good analysis, thanks. Reax/C builds and uses its own
neighbor lists, which is essentially what write_reax_lists does.

I believe the reason that the AuO example consumes more fraction of
cpu time within the wirte_reax_lists function is because of the
difference in structures or the force fields. If you examine the
original examples you will find AuO has 384 neighbors per atom, while
AB has only 27. This could due to a denser AuO structure or longer
cutoffs in ffield.reax.AuO. Therefore I would not worry about this.

Best,
Ray

Dear Nicola,

Thanks for sharing your findings. USER-REAXC package is essentially the
LAMMPS integration of the PuReMD code. We had done a detailed
scalability analysis of the PuReMD code in the following paper:

http://dl.acm.org/citation.cfm?id=2170089
"Parallel reactive molecular dynamics: Numerical methods and algorithmic
techniques", Parallel Computing, Volume 38 Issue 4-5, April, 2012, Pages
245-259

As far as I can tell, the main factor that hinders the scalability of
PuReMD (and therefore USER-REAXC) is the QEq part where we solve a large
linear system using the CG algorithm at each timestep.

As Ray pointed out, write_reax_lists simply builds the USER-REAXC
neighbor list from that given by LAMMPS. The inefficiency of
write_reax_lists is something that I was well aware of, but I never got
a chance to look at it in depth.

I suspect that the source of the inefficiency is the method that we use
to find the neighbors of ghost atoms. At the time USER-REAXC was
completed, LAMMPS did not have an option for computing the neighbors of
ghost atoms - which is a must for ReaxFF to compute bonds. Therefore we
were using a work-around.

I believe Steve has added this functionality later on. After graduating
from Purdue, I can barely find time to work on the USER-REAXC code. So I
did not have a chance to go over that part again. If you will be able to
invest some time into this, I would be more than happy to provide you
some guidance.

Thanks,
Metin

I think Steve has a valid point here. Using the scalability results from
a 104 atom system to guess how USER-REAXC would perform on thousands of
atoms can be very misleading.

My rule of thumb to obtain a good parallel efficiency with USER-REAXC is
to assign subdomains of dimensions which are at least 10A by 10A by 10A.
Here I assume that the non-bonded interaction cutoff is 10A, which is
the case for most ReaxFF force fields.

Also as Ray pointed out, the performance of USER-REAXC will strongly
depend on how densely your atoms are packed. AuO seems to be rather
densely packed.

Thanks,
Metin

Dear all, many thanks for all your comments. My intention was not to provide a detailed analysis of the ReaxxFF scalability.
I am working for IVEC, and I've to support Julian Gale's group in MD simulations of biominerals.
Reactive force field is a good candidate for those kind of simulations, and I've some time to spend in optimise the code.
With their simulation they have found that Reax/c is about 20x slower than the same system(in term of number of atoms) but
with the pair_style hybrid/overlay coul/long 9.0 lj/cut 9.0. Do you have some similar analysis?
So I've tried to analyze the performance of Reax/c and I've found that the routine write_reax_lists is incredibly expensive. That's
why I wrote the email below. The system analyzed are quite small but the data below shows that the routine write_reax_lists
seems more time consuming (in percentage value) when the sample size increase.
I've tried even an example around 30K atoms and in that case the routine takes about 75% of the total time.
Maybe the situation can be different on 1M of atoms on 1K cores. Maybe we can try this situation as well...
However is probably worth to work on this routine. If you can give me some support and or guidance I am more than happy
to spend some effort.

All the best

N.

Hi Nicola,

I think it is very reasonable that reax/c is ~20x slower than
hybrid/overlay coul/long lj/cut, since the former includes additional
3-body, 4-body, non-bonded long range (which is lj and coul with 10.0
A cutoffs), complicated bond-order term, and charge equilibration.
The benchmark webpage Steve mentioned previously has very detailed
test results: http://lammps.sandia.gov/bench.html

write_reax_lists has two functions, 1) build reax/c neighbor list from
LAMMPS neighbor list (half, newton off) and 2) calculate and store
distances between neighbors, which will be used by all force
calculations. This way when calculating force, bond distances can be
directly accessed, without additional calculations. This is one of
the reason that write_reax_lists consumes this much fraction of cpu
time and calculate_force seems so efficient.

write_reax_lists scales with N_local * n * n, where N_local and n are
number of local atoms per cpu and number of neighbors per atom,
respectively. However, write_reax_lists can be modified by using the
recently implemented "ghost atom neighbor list", which will scale with
N_total * n, where N_total is the sum of number of local and ghost
atoms per cpu. Which is one is superior actually depends on various
conditions such as number of cores, force field, material structure
and system size. Modification could require some work and effort.

Best,
Ray

Dear Ray,

thanks very much for your consideration. To stay focus on the Au_O example I did a deeper profiling. I've subdivided the
method write_reax_lists in three parts:

*First*
    for( itr_j = 0; itr_j < numneigh[i]; ++itr_j ){
       j = jlist[itr_j];
       j &= NEIGHMASK;
       get_distance( x[j], x[i], &d_sqr, &dvec );
       dist[j] = sqrt( d_sqr );

       if( dist[j] <= control->nonb_cut ){
         set_far_nbr( &far_list[num_nbrs], j, dist[j], dvec );
         ++num_nbrs;
       }
     }

*Second*
     for( itr_j = 0; itr_j < numneigh[i]; ++itr_j ){
       j = jlist[itr_j];
       j &= NEIGHMASK;

       if( j >= nlocal && !marked[j] &&
           dist[j] <= (control->vlist_cut - control->bond_cut) ){
         marked[j] = 1;
         Set_Start_Index( j, num_nbrs, far_nbrs );

         for( itr_g = 0; itr_g < numneigh[i]; ++itr_g ){
           g = jlist[itr_g];
           g &= NEIGHMASK;

           if( g >= nlocal && !marked[g] ){
             get_distance( x[g], x[j], &g_d_sqr, &g_dvec );

             if( g_d_sqr <= SQR(control->bond_cut) ){
               g_d = sqrt( g_d_sqr );

               set_far_nbr( &far_list[num_nbrs], g, g_d, g_dvec );
               ++num_nbrs;
             }
           }
         }
         Set_End_Index( j, num_nbrs, far_nbrs );
       }
     }

*Third*
   for( i = 0; i < system->N; ++i )
     if( !marked[i] ) {
       marked[i] = 1;
       Set_Start_Index( i, num_nbrs, far_nbrs );

       for( j = i+1; j < system->N; ++j )
         if( !marked[j] ) {
           get_distance( x[j], x[i], &d_sqr, &dvec );
           if( d_sqr <= SQR(control->bond_cut) ) {
             set_far_nbr( &far_list[num_nbrs], j, sqrt(d_sqr), dvec );
             ++num_nbrs;
           }
         }

       Set_End_Index( i, num_nbrs, far_nbrs );
     }

The total time of execution is 720 second, while the first part take 45 second, the second 11 and the third 357.

So the third part take about *50%* of the whole calculation.

N.

Hi Nicola,

I have modified reax/c to be able to use the ghost neighbor list, and for
the AuO example I obtained 6-19% overall speedup, depending on system size
and number of cores. For some other test cases I generally obtain an
overall 10-15% improvement in efficiency.

Please test this and let me know if you do get similar speedup, thanks.

Best,
Ray Shan

pair_reax_c.cpp (19.7 KB)

fix_qeq_reax.cpp (22.1 KB)

neighbor.cpp (61.4 KB)

neigh_full.cpp (19 KB)

neighbor.h (15.1 KB)

Dear Shan, I've tried on a couple of systems and the results are quite good. The comparison is done on a single core.
Below ghost stay for the ghost neighbor list method while old stay for the old write_reax_lists method

*CaCO3* ( 1561 atoms )
Method Total_Time Write_reax_lists Compute_Forces
ghost 126 62 39
old 226 173 34

*Au_O* ( 7680 atoms )
Method Total_Time Write_reax_lists Compute_Forces
ghost 413 103 254
old 715 424 234

In both cases half of the time of Write_reax_lists is spent in the method set_far_nbr.

Many thanks for your improvements.

Nicola

I've just added Ray's performance improvements to pair reax/c
as a 25Jun patch.

Thanks,
Steve