USER-MEAM compilation

David_Foster · July 31, 2018, 2:54pm

Dear Users,
I have compiled “user-meam” package with “intel_cpu_intelmpi” on a SMP system. It was compiled without any error. I ran tests with OMP_NUM_THREADS=1 and “mpirun -np …” without any error. In this case, all core were used.

However, whenever I used OMP_NUM_THREADS=xx, and use lmp_intel_cpu_intelmpi without mpirun command, only one core is used. Other packages such as “molecule” ran with openmp without any problem.

I think openmp on SMP is faster than mpi. So, any help is appreciated.

Regards

David

akohlmey · July 31, 2018, 3:07pm

Dear Users,
I have compiled "user-meam" package with "intel_cpu_intelmpi" on a SMP system. It was compiled without any error. I ran tests with OMP_NUM_THREADS=1 and "mpirun -np ..." without any error. In this case, all core were used.

it is USER-MEAMC.

However, whenever I used OMP_NUM_THREADS=xx, and use lmp_intel_cpu_intelmpi without mpirun command, only one core is used. Other packages such as "molecule" ran with openmp without any problem.

I think openmp on SMP is faster than mpi. So, any help is appreciated.

a) you are thinking wrong. with domain decomposed MD codes, using MPI,
especially if a software like LAMMPS has been built from ground up to
efficiently do MPI parallelization, is usually more efficient or as
efficient. in LAMMPS multi-threading is added after the fact and there
are multiple issues with performance.
b) in LAMMPS using multi-threading requires using specific
multi-threaded styles. there are three different packages with thread
support (and different requirements and strategies). please review
https://lammps.sandia.gov/doc/Section_accelerate.html for more
details.
c) because of b) simply setting OMP_NUM_THREADS doesn't do anything.
d) there is no multi-threaded version of USER-MEAMC

axel.

David_Foster · July 31, 2018, 3:32pm

Dear Axel,

Dear Users,
I have compiled “user-meam” package with “intel_cpu_intelmpi” on a SMP system. It was compiled without any error. I ran tests with OMP_NUM_THREADS=1 and “mpirun -np …” without any error. In this case, all core were used.

it is USER-MEAMC.

It was typo. sorry. I used USER-MEAMC

However, whenever I used OMP_NUM_THREADS=xx, and use lmp_intel_cpu_intelmpi without mpirun command, only one core is used. Other packages such as “molecule” ran with openmp without any problem.

I think openmp on SMP is faster than mpi. So, any help is appreciated.

a) you are thinking wrong. with domain decomposed MD codes, using MPI,
especially if a software like LAMMPS has been built from ground up to
efficiently do MPI parallelization, is usually more efficient or as
efficient. in LAMMPS multi-threading is added after the fact and there

are multiple issues with performance.

Thanks for your information about LAMMPS. With other codes such as LATTE (with intelmpi/mpiifort), I got faster time when I simply used above strategy (setting OMP_NUM_THREADS to total core) on SMP. I am not sure, but I think MPI is the best case for clusters with distributed memory. At least for some other codes OMP is better for shared memory systems.

b) in LAMMPS using multi-threading requires using specific
multi-threaded styles. there are three different packages with thread
support (and different requirements and strategies). please review
https://lammps.sandia.gov/doc/Section_accelerate.html for more

details.

Thanks for your information.

c) because of b) simply setting OMP_NUM_THREADS doesn’t do anything.

d) there is no multi-threaded version of USER-MEAMC

I guessed that USER-MEAMC doesn’t use openmp, but I was not sure. Thank you.

axel.

akohlmey · July 31, 2018, 3:36pm

Thanks for your information about LAMMPS. With other codes such as LATTE (with intelmpi/mpiifort), I got faster time when I simply used above strategy (setting OMP_NUM_THREADS to total core) on SMP. I am not sure, but I think MPI is the best case for clusters with distributed memory. At least for some other codes OMP is better for shared memory systems.

it *very* much depends on the physics and how it is translated to
software and how much skill and effort people have invested in either
parallelization scheme.
for trivial parallelization problems, there are little differences
between the two. but for complex problems, the situation can be quite
difficult. a replicated data classical MD code is much easier to
implement with MPI than a distributed data code. however, the only the
latter will give you good strong scaling beyond a handful of
processors. writing an OpenMP parallel code is often easier, but there
are a number of performance challenges. some of which (false sharing,
race conditions) are not quite as obvious.
i've used this case of a simple MD code for a case study lecture at
workshops on code optimization and parallelization. the material is
posted at: https://sites.google.com/site/akohlmey/software/ljmd

axel.