[lammps-users] looking for LAMMPS test inputs for multi-core optimizations

akohlmey · May 29, 2010, 6:42pm

hi everybody,

with the help of two undergraduate interns, i have been
working on implementing OpenMP parallelization on top of
the existing MPI parallelization in LAMMPS. now we are
looking for some representative test inputs (not to small,
not too large), of particular interest are inputs for
potentials like AIREBO, yukawa, morse, and buckingham,
but others are highly welcome, too. you can find the
full list and access to the sources at:

http://sites.google.com/site/akohlmey/software/lammps-icms
please use at your own risk, we are still in the process
of testing, but you are welcome to run tests yourself and
just give us feedback on your results.

first results so far are quite promising

in fact, through converting the innterloops into templated functions,
were able to even speed up regular all-MPI runs. the benefits can be
significant (the best so far was 3x), particularly when running across
many nodes with multi-core processors. here is one example for a
12288 atom stillinger-weber potential system. this is all using
3 nodes with 2x intel quadcore E5430/2.66GHz processors (harpertown)
with a (fast) DDR infiniband interconnect:

Loop time of 10.5348 on 12 procs / 1 threads for 1000 steps with 12288
atoms
Loop time of 7.26549 on 24 procs / 1 threads for 1000 steps with 12288
atoms
Loop time of 6.60967 on 6 procs / 4 threads for 1000 steps with 12288
atoms
Loop time of 6.01909 on 12 procs / 2 threads for 1000 steps with 12288
atoms

going from 12 MPI to 24 MPI tasks the parallel efficiency drops to 45%,
whereas the efficiency of the OpenMP parallelization (note, that only
the pair style is parallelized) is still at 75% and as a consequence
there is over 15% speedup using two threads and 12 MPI tasks instead
of 24 MPI tasks and no threading. this relative improvement should
increase when using more nodes or when using a slower interconnect.

for example running the in.rhodo benchmark scaled to 864,000 atoms
on a cray xt5 machine using 64 nodes the corresponding speedup was
about 33%.

comments, suggestions, help are highly welcome.

thanks,
axel.