Building/testing LAMMPs with Kokkos for Xeon Phi?

Hi,

I’m interested to try out the LAMMPS (29Aug2014) with Kokkos functionality targeting Xeon Phi. Does anyone have experience with appropriate Makefile settings to build LAMMPS in this case? For example, in the instructions here: http://lammps.sandia.gov/doc/Section_accelerate.html#acc_8 it states:

Intel Xeon Phi:

cd lammps/src

make yes-kokkos

make g++ OMP=yes MIC=yes

which if I understand correctly would use the Makefile included in the source distribution at src/MAKE/ Makefile.g++?

I’ve been trying to make LAMMPS with Kokkos for use on Xeon Phi with variations on the src/MAKE/Makefile.intel (using Intel Compiler 14.0.3, Intel MPI 5.0.1.035, Intel MPSS 3.3). While I am getting a successful build, when I try to run for example the colloid example included with LAMMPS:

mpirun -hosts rx350-1-mic0 -n 4 -env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH …/src/lmp_intel_intelmpi -k on t 1 -sf kk < in.colloid > colloid_test.log 2>&1

(and I’ve tried different parameters for –n and –k on t, e.g 2/1 2/2 etc), I always get the following crash at runtime immediately after the LAMMPS processes start on the Xeon Phi

$ cat colloid_test.log

LAMMPS (29 Aug 2014)

KOKKOS mode is enabled (…/lammps.cpp:469)

using 1 OpenMP thread(s) per MPI task

*** glibc detected *** …/src/lmp_intel_intelmpi: realloc(): invalid old size: 0x0000000001b0fc00 ***

*** glibc detected *** …/src/lmp_intel_intelmpi: realloc(): invalid old size: 0x0000000001f4c980 ***

*** glibc detected *** …/src/lmp_intel_intelmpi: realloc(): invalid old size: 0x0000000002fff980 ***

*** glibc detected *** …/src/lmp_intel_intelmpi: realloc(): invalid old size: 0x00000000021ab980 ***

Some of the compiler options in the standard src/MAKE/Makefile.intel are:

CC = mpiicpc -openmp -DLAMMPS_MEMALIGN=64 -no-offload

CCFLAGS = -O3 -xHost -fno-alias -ansi-alias -restrict -override-limits

SHFLAGS = -fPIC

DEPFLAGS = -M

LINK = mpiicpc -openmp

LINKFLAGS = -O3 -xHost

-xHost obviously is not compatible with –mmic, what about the other options here? Or is the problem possibly elsewhere?

Best Regards,

Paul.

Hi

Makefile.intel is intended for the Intel package which uses offload mode. Kokkos is run in native mode (as you did guessing from your commandline). But assuming you build using
"make intel MIC=yes OMP=yes" and you used the MIC variant of Intel-MPI you should have gotten a correct executable. As Sikandar suggested try running the LJ example. And see if that works. But again as Sikandar said the Kokkos package its in its very early stages we are still hunting bugs and try to eliminate those before expanding much on capabilities. I'll try running the colloid example and see whether I can reproduce your issue. But even if it works eventually don't expect to good performance on KNC. Its not what I would consider a good platform for MD, though we have much higher hopes for the next generation, where certain issues we have right now will be (at least partly) addressed.

Christian

Its not what I would consider a good platform for MD,

To be fair, the numbers that we and some of our customers are measuring with KNC offload are comparable and in some cases significantly better than with offload to Tesla K40c or K20. I think that the same is true for NAMD, although I am not sure I have the latest GPU numbers. I should be presenting numbers soon that include production workloads and send link, but it should be easy to measure a couple of cases based on the examples/intel/README.

though we have much higher hopes for the next generation

Knights Landing is very exciting architecture...

- Mike

Hi

I didn¹t really want to bash Xeon Phi in general, I just want to manage
expectations. In native mode (that¹s what we are mostly interested in
since both NERSC8 and Sandias next machine Trinity will have self hosted
KNL) our experience (not just with LAMMPS) is that we rarely can match or
even beat performance of a dual Ivy Bridge without going to explicit
intrinsic programming - something we don¹t do here for
maintainability/transferability reasons. So if you now take LAMMPS as it
comes and try to use the vanilla code, the OpenMP or the Kokkos package on
Xeon Phi (in particular across multiple Xeon Phis) in native mode you
should expect performance which is more in the range of a single socket 12
core Ivy Bridge processor.

For many many use cases of LAMMPS we can¹t match dual Ivy Bridge
performance with a K40 either because to much data transfer goes on or
necessary capabilities are not yet implemented/optimized. Essentially GPUs
are most useful if all features you want to use can be run on the GPU. But
in those cases a K40 can beat a dual Ivy Bridge significantly.

With regards to NAMD the latest stuff I have seen was from a presentation
earlier this year (google "NAMD ixpug"). In that case Titan was faster
than Stampede, and in particular much better scaling. My discussions with
the Gromacs developers are similar, that so far they were not able to get
benefits from Xeon Phi over using the same power budget just for CPUs.
So if anybody has updated information on success stories with Xeon Phi in
some of the larger MD codes I would be highly interested in that.

That said, my feeling is we need to now try to optimize on Xeon Phi as
much as possible, and then I expect to see a very significant - in
particular larger than the nominal peak flop increase - jump with KNL due
to a number of architectural improvements.

Christian

I agree w/ the strategy of optimizing native/symmetric performance in preparation for KNL and that this is the way to go with Kokkos, but not that this is a useful way to assess heterogenous models. For heterogenous and hybrid machines, which can still be the case with KNL, I think that an offload approach can potentially perform better than trying to run with symmetry on different chips - this has been demonstrated in LAMMPS, of course, for both GPU and Xeon Phi. I think it is important not to generalize observations based on the specific implementations to the architecture.

With the new load balancing algorithms in LAMMPS, it might be possible to do interesting adjustments for "symmetric" runs using MPI on host and coprocessor.

I also agree that the entire situation will be much better with KNL.

For NAMD, I don't think you should compare parallel efficiency with an IB tree topology to a Gemini 3D-torus. The plot with both on stampede shows similar parallel efficiency. The latest publically available numbers are here, for standard benchmarks with instructions to reproduce:

https://software.intel.com/en-us/articles/namd-for-intel-xeon-phi-coprocessor

Unfortunately this split 23PPN and 47PPN into different plots, so you have to see which performs best at a given node count, but I think this is a power win in most of the points. Intel prefers to provide instructions for users to reproduce and compare with GPU than to provide GPU numbers. I find it difficult to believe that based on the numbers above, that XK7 node performance (w/ 1S interlagos) would beat stampede with 2S xeon (since this is not a fair comparison). LAMMPS 2S ivy town + KNC for 512K rhodo benchmark is 1.92X simulation rate for an XK7 node+GPU. Maybe this was an older version. As I said, I am not sure of the latest GPU numbers for NAMD, but the results in the code recipe look ok to me.

If we are talking about native performance, Gromacs is the best out of the MD codes. For 512K water benchmark with reaction field, KNC native performance is 1.03X a 2 socket E5-2697v2. I think this is a small win in terms of power. I have also seen good numbers for symmetric with gromacs when using RF, but I think that the team is still working on strategies that are general for parallel simulations though. Not sure about power efficiency for GPU with Gromacs, but I am very skeptical.

- Mike

The current doc pages for doc/Section_accelerate.html,
specifically doc/accelerate_kokkos.html have specifics
on how to build and run a problem for Kokkos Phi support.
As Christian said, that is different than for Phi support
via the USER-INTEL package, which is what Makefile.intel
is doing. Also, pair colloid is currently not supported by either
KOKKOS or USER-INTEL, so not clear on why you
are trying that one.

Steve

FYI, there are some numbers here for LAMMPS on different Intel xeon and phi procs and also GPU:

https://sites.google.com/site/wmbrown85/brown_hpcuf_54.pdf?attredirects=0

This talk has some slides discussing the optimizations that were done. I post this here since one user asked about this, it helps explain the changes that were done, and in case it is of use for your work:

https://sites.google.com/site/wmbrown85/brown_lammps_vii_emmsb.pdf?attredirects=0

Best,

- Mike