Problems in using MPI+OMP to calculate tip4p

Dear LAMMPS users,

As using MPI+OMP could increase the efficiency in certain cases, I did two runs for comparison both use 8 cores on 1 node. The system is a water droplet confined in graphene slabs. Dreiding force field and tip4p are used for graphene and water separately. The 1st is MPI only (8 MPI tasks) and the 2nd is 2 MPI tasks with 4 OMP threads per each.

While the MPI-only ran with out any difficulties, the 2nd one crashes due to out of range atoms. Usually this is due to that there are something wrong with the initial structures and set-up. However, as the MPI only works, this seems not the case here.

As from MPI-only to MPI+OMP I only did the following changes.

  1. Added package omp 4 force/neigh in the in file.

  2. In the pbs file, replace the normal MPI submisson line with
    mpirun -x OMP_NUM_THREADS=4 -np 2 /home/565/mxm565/bin/lmp_5Aug13-f -sf omp -screen scr.log -in in

So I don’t understand why I get such error?

Another question regarding about this is I have set OMP_NUM_THREADS=4 explicitly both in the terminal and in the PBS file, why at the beginning of the log file, LAMMPS still use 1 OpenMP threads as it shows
LAMMPS (5 Aug 2013)
using 1 OpenMP thread(s) per MPI task
package omp *

All the files are attached including the log files. A separate file called cmd.log is generated via lmp_5Aug13-f -help > cmd.log to show the commands available.

Thanks for your help in advance.

Best

Ming

in-and-output-files.tar.gz (724 KB)

Dear LAMMPS users,

As using MPI+OMP could increase the efficiency in certain cases, I did two
runs for comparison both use 8 cores on 1 node. The system is a water
droplet confined in graphene slabs. Dreiding force field and tip4p are used
for graphene and water separately. The 1st is MPI only (8 MPI tasks) and
the 2nd is 2 MPI tasks with 4 OMP threads per each.
  While the MPI-only ran with out any difficulties, the 2nd one crashes
due to out of range atoms. Usually this is due to that there are something
wrong with the initial structures and set-up. However, as the MPI only
works, this seems not the case here.

no. this is due to you using temperature biasing/unbiasing. this is not
thread safe in the stock LAMMPS version.

  As from MPI-only to MPI+OMP I only did the following changes.
  1. Added *package omp 4 force/neigh* in the *in* file.
  2. In the pbs file, replace the normal MPI submisson line with
  mpirun -x OMP_NUM_THREADS=4 -np 2 /home/565/mxm565/bin/lmp_5Aug13-f -sf
omp -screen scr.log -in in
  So I don't understand why I get such error?

you have to either use LAMMPS-ICMS or insert the line

suffix off

before you define any fixes and computes.

  Another question regarding about this is I have set OMP_NUM_THREADS=4
explicitly both in the terminal and in the PBS file, why at the beginning
of the log file, LAMMPS still use 1 OpenMP threads as it shows
  *LAMMPS (5 Aug 2013)
  using 1 OpenMP thread(s) per MPI task
package omp **

you have to talk to the people operating your cluster. this has nothing to
do with lammps, but all with the local setup.

axel.

Hi Axel,

Thanks for your help.

  1. I added suffix off before all the compute and fix, but the out of range atoms problem still happened.

  2. By ’ This is due to you using temperature biasing/unbiasing. this is not thread safe in the stock LAMMPS version. ', I don’t quite understand. First, what do you mean by temperature biasing/unbiasing? Second, if this means simulation at a finite temperature, does it mean that I can’t running LAMMPS with thread at a finite temperature for certain conditions?

Best

Ming

Hi Axel,

Thanks for your help.
1. I added suffix off before all the compute and fix, but the out of range
atoms problem still happened.

works for me. perhaps your LAMMPS executable is miscompiled.

2. By ' This is due to you using temperature biasing/unbiasing. this is
not thread safe in the stock LAMMPS version. ', I don't quite understand.
First, what do you mean by temperature biasing/unbiasing?

you use: fix_modify 3 temp flow_comtemp

and: fix_modify 4 temp gra_temp

which instructs the termostat/integrators to remove a bias before
thermostatting, stash it away and put it back.
this entire process is not thread safe the way it is done in the stock
LAMMPS version.
that mean, you must *not* use it with fix nvt/omp, but you *should* use fix
nvt.
however, if you use -suffix omp, you *will* use nvt/omp, unless you turn
suffix processing off.

Second, if this means simulation at a finite temperature, does it mean
that I can't running LAMMPS with thread at a finite temperature for certain
conditions?

this question makes no sense. so i assume the answer would be no.

axel.

Hi Axel,

I’m real happy to know that it works for you as I see some hope to solve it. However, I don’t get it why for me it didn’t work. I added the suffix off before the group definition, i.e.


suffix off

global group definition

group gra type 3
group oxygen type 1
group hydrogen type 2
group flow type 1 2

I’ve also attached the revised in file here, so if you could test this in file with your lammps built, I’ll appreciate it a lot.

About the built of LAMMPS, I don’t get it what do you mean by miscompiled. I used intel compiler (version 12.1.9.293) with openmpi/1.6.3, the CCFLAGS is listed below,

CCFLAGS = -fast -no-prec-sqrt -openmp
-funroll-loops -fstrict-aliasing -Wall -W -Wno-uninitialized

I’ve also used FFTW/3.3.3 with foat precision (-DFFT_SINGLE). Although this flags seems to be aggressive (-fast), but the reason I used it is because from one of your early reply regarding about compilation using Intel (http://lammps.sandia.gov/threads/msg33228.html), you also suggested used the following settings which is almost the same.

-O3 -xHOST -no-prec-div -no-prec-sqrt -fast-transcendentals -pc64
-ansi-alias -fno-rtti -fno-exceptions

Best

Ming

in (2.12 KB)

Hi, this is following my previous message. I’m wondering is there anybody could run LAMMPS using MPI+OPENMP with tip3p or tip4p model?

Best
Ming

Hi, this is following my previous message. I'm wondering is there anybody
could run LAMMPS using MPI+OPENMP with tip3p or tip4p model?

sure. i'm just re-checking my tests from here:
http://git.icms.temple.edu/git/?p=lammps-icms.git;a=tree;f=bench-accel/bench_tip4p-shake;hb=HEAD

and they seem to work fine.

axel.

Hi Axel,

I just went through the files you posted in your link, it seems that you were not using the omp version of LAMMPS as in all the three log files there, the 2nd line shows

using 1 OpenMP thread(s) per MPI task

and in the summary part they show like

Loop time of 13.4297 on 4 procs (4 MPI x 1 OpenMP) for 100 steps with 23814 atoms

, which indicate only one OpenMP threads per MPI task were used.

Best

Ming

because these are my _reference_ results. those were actually done
completely _without_ OpenMP.
i use those to compare any test calculations with, as the OpenMP run need
to produce the same energies.

the line in the output doesn't mean that OpenMP is used. it only says that
1 thread is available.
you need to use more detailed internal profiling to know how much OpenMP is
used. the patch for that is in steve's inbox and already available in
LAMMPS-ICMS. if you download/compile today's LAMMPS-ICMS, you can add the
command "timers full" and then you'll get something that looks like this
(this was run on 8 nodes with 2x6-core CPU and the test system enlarged by
replicate 2 2 2):

mpirun -npernode 4 -x OMP_NUM_THREADS=3
~/compile/lammps-icms/src/lmp_owlsnest-omp -nocite -log none -in
in.tip4p-big -sf omp

[...]

Loop time of 7.83227 on 96 procs for 200 steps with 190512 atoms
299.8% CPU use with 32 MPI tasks x 3 OpenMP threads
Performance: 4.413 ns/day 5.439 hours/ns 25.535 timesteps/s

MPI task timings breakdown
Section | min time | avg time | max time |%varavg| %CPU | %total

Hi Axel,

This detailed profiling of the coming version of LAMMPS is really good!

By the way, regarding my question, I still have the following problems,

  1. In your log files, in the 2nd line still say something like

using 1 OpenMP thread(s) per MPI task

even if in your first run which was using 3 OMP threads per task?

  1. Is there any information during the compilation stage to show that the OMP version are included? During my compilation, I’ve done the following things to make sure I’ve added the OMP version,

a) Added -openmp in CCFLAGS and LINKFLAGs ( I used intel compiler ).
b) When I compile OMP as I run make add-user-OMP as the last one.

c) Checked the Makefile.package before run make

So could I make sure that OMP version are included?

  1. The last one is a request, since you suggested it could also some miscompiling, but since to re-compile LAMMPS is quite time consuming (for me it takes about 50 minutes), I’ve attached my data files here, if possible, could you spend 2 minutes in running it on your machine to see whether they works so that I could know whether it is due to the miscompiling.

Thanks a lot

Best
Ming

omp-mpi.tar.gz (720 KB)

Hi Axel,

This detailed profiling of the coming version of LAMMPS is really good!
By the way, regarding my question, I still have the following problems,

1. In your log files, in the 2nd line still say something like

       using 1 OpenMP thread(s) per MPI task

    even if in your first run which was using 3 OMP threads per task?

no.

LAMMPS (12 Aug 2013-ICMS)
  using 3 OpenMP thread(s) per MPI task
package omp *
using multi-threaded neighbor list subroutines
prefer double precision OpenMP force kernels
units real
atom_style full
dimension 3
boundary p p p

2. Is there any information during the compilation stage to show that the
OMP version are included? During my compilation, I've done the following
things to make sure I've added the OMP version,

   a) Added -openmp in CCFLAGS and LINKFLAGs ( I used intel compiler ).
   b) When I compile OMP as I run make add-user-OMP as the last one.
   c) Checked the Makefile.package before run make

   So could I make sure that OMP version are included?

you can easily test this yourself. when you run lammps with the -h
flag, it shows you all included styles.
and you won't get any output mentioning OpenMP, unless you have
compiled with OpenMP support.
most likely, there is something odd about your MPI installation, or
you have a different library than what i use and thus the way how you
determine job placement and enable threads on the compute nodes is
different. you have to read the corresponding documentation and/or
talk to the people running that machine. that is why i always test on
my desktop/laptop first. this way in know for a fact how it is set up
and should work.

3. The last one is a request, since you suggested it could also some
miscompiling, but since to re-compile LAMMPS is quite time consuming (for me
it takes about 50 minutes), I've attached my data files here, if possible,
could you spend 2 minutes in running it on your machine to see whether they
works so that I could know whether it is due to the miscompiling.

no can do. i had luck to find a few empty nodes for testing this
morning, since the cluster i am testing on is 6 timezones away. those
are all occupied now.

there are a gazillion things that can go wrong. not only the compiler.

do what every normal person does. pick a few of the benchmark test
inputs staring with the lj benchmark input and validate those.
then add piece by piece more complications and compare threads vs. no threads.

also, i would test on a desktop first. install only packages that you
absolutely need and then compilations will be quite fast.

axel.

to re-compile LAMMPS is quite time consuming (for me it takes about 50 minutes),

Why is that?

a) don’t build with packages you’re not using
b) use make -j to build in parallel

On my box (12 cores), using just the standard packages
(minus a few I don’t normally use), it takes 15 secs
to build LAMMPS with g++.

Steve

Hi Axel,

Thanks for your quick replying. The head of the log files you showed answer my question. Also, I just ran the -h with lammps and I could see all the omp related commands, which indicate that they were included in the compilation. And I did contacted the support of the supercomputer I’m using yesterday, and their reply is shown as following, which is what I’m using now. However, I’ll still try to contact them about this as it seems doesn’t work.

If you want to try lammps with combination of omp + mpi, it may sense to use 1 MPI process per socket and let cores run the omp threads.
On raijin, mpirun shoud look like (pbs script in bash) :

export OMP_NUM_THREADS=8
mpirun --npersocket 1 -np $(( $PBS_NCPUS / 8 ))

Hi Steve,

Thanks for your instructions. I did only include the packages I use and I’ve noticed that when I compiled LAMMPS using intel compiler, they were quite slow and it seems like something due to the license issue. By the way, I never tried -j option for make before but I’ll give it a try.

Best
Ming

Hi,

This is a quick update. I followed your suggestion and use the example, melt (rapid melt of 3d LJ system), to test. The only input file in is attached.

On one node, I tried

  1. 8 MPI task + 1 threads per task, ok.

  2. Submit using mpirun -x OMP_NUM_THREADS=8 --npersocket 1 -np 1 ~/bin/lmp_5Aug13-f -sf omp -screen scr.log -in in, finished, but only one MPI task with 1 OpenMP threads were used, but from the log file, the omp version was used as it shows
    Last active /omp style is pair_style lj/cut/omp

  3. Added
    package omp 8 force/neigh
    in the in file, and I got the following errors (listed at the bottom)

So does it mean that the problem is caused by the compilation of LAMMPS or the set-up of the openmpi on the supercomputer (as in the 2nd run, omp was invoked)? If is is caused by the compilation, should I use less aggressive optimizations (now I’m using -fast for intel compiler)?

Thanks

Best

Ming

[r2404:10727] *** Process received signal ***
[r2404:10727] Signal: Segmentation fault (11)
[r2404:10727] Signal code: Address not mapped (1)
[r2404:10727] Failing at address: 0x634f9000
[r2404:10727] [ 0] /lib64/libpthread.so.0(+0xf500) [0x7fa0e8b03500]
[r2404:10727] [ 1] /apps/openmpi/1.6.3/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x7b1) [0x7fa0e9fc8ee1]
[r2404:10727] [ 2] /apps/openmpi/1.6.3/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0xbf) [0x7fa0e9fc979f]
[r2404:10727] [ 3] /apps/openmpi/1.6.3/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0xb3) [0x7fa0e9fca3b3]
[r2404:10727] [ 4] /usr/lib64/libstdc++.so.6(_Znwm+0x1d) [0x7fa0ea54609d]
[r2404:10727] [ 5] /usr/lib64/libstdc++.so.6(_Znam+0x9) [0x7fa0ea5461b9]
[r2404:10727] [ 6] /home/565/mxm565/bin/lmp_5Aug13-f(_ZN9LAMMPS_NS6FixOMPC1EPNS_6LAMMPSEiPPc+0x645) [0xba6085]
[r2404:10727] [ 7] /home/565/mxm565/bin/lmp_5Aug13-f(_ZN9LAMMPS_NS6Modify11fix_creatorINS_6FixOMPEEEPNS_3FixEPNS_6LAMMPSEiPPc+0x47) [0xb52807]
[r2404:10727] [ 8] /home/565/mxm565/bin/lmp_5Aug13-f(ZN9LAMMPS_NS6Modify7add_fixEiPPcS1+0xb5b) [0xe6ee5b]
[r2404:10727] [ 9] /home/565/mxm565/bin/lmp_5Aug13-f(_ZN9LAMMPS_NS5Input15execute_commandEv+0x11ee) [0xe6d80e]
[r2404:10727] [10] /home/565/mxm565/bin/lmp_5Aug13-f(_ZN9LAMMPS_NS5Input4fileEv+0x15b) [0xb9d75b]
[r2404:10727] [11] /home/565/mxm565/bin/lmp_5Aug13-f(main+0xa0) [0xb7fea0]
[r2404:10727] [12] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7fa0e877fcdd]
[r2404:10727] [13] /home/565/mxm565/bin/lmp_5Aug13-f() [0x54abd9]
[r2404:10727] *** End of error message ***

in (464 Bytes)

Hi,

This is a quick update. I followed your suggestion and use the example, melt
(rapid melt of 3d LJ system), to test. The only input file in is attached.

On one node, I tried

1. 8 MPI task + 1 threads per task, ok.

2. Submit using mpirun -x OMP_NUM_THREADS=8 --npersocket 1 -np 1
~/bin/lmp_5Aug13-f -sf omp -screen scr.log -in in, finished, but only one
MPI task with 1 OpenMP threads were used, but from the log file, the omp
version was used as it shows
   Last active /omp style is pair_style lj/cut/omp

3. Added
   package omp 8 force/neigh
   in the in file, and I got the following errors (listed at the bottom)

   So does it mean that the problem is caused by the compilation of LAMMPS
or the set-up of the openmpi on the supercomputer (as in the 2nd run, omp
was invoked)? If is is caused by the compilation, should I use less
aggressive optimizations (now I'm using -fast for intel compiler)?

you have a segmentation fault in the MPI library. not in LAMMPS.
i would *definitely* try compiling LAMMPS with GCC for a change.

axel.

Hi Axel,

[...]

3. The last one is a request, since you suggested it could also some
miscompiling, but since to re-compile LAMMPS is quite time consuming (for me
it takes about 50 minutes), I've attached my data files here, if possible,
could you spend 2 minutes in running it on your machine to see whether they
works so that I could know whether it is due to the miscompiling.

here is what i got with your input on a different machine this
morning. i had 4 nodes with dual quad-core CPU.

log.mpi-only-4x1:Loop time of 4.99471 on 4 procs for 200 steps with 10817 atoms
log.mpi-only-4x2:Loop time of 3.12021 on 8 procs for 200 steps with 10817 atoms
log.mpi-only-4x4:Loop time of 2.4714 on 16 procs for 200 steps with 10817 atoms
log.mpi-only-4x8:Loop time of 2.02534 on 32 procs for 200 steps with 10817 atoms

this is the time using MPI only with 4 nodes and -npernode 1,2,4,8

here is the detailed timing breakdown for the last run and you can see
that there still is a significant load imbalance
and that Kspace is dominant.

Loop time of 2.02534 on 32 procs for 200 steps with 10817 atoms
99.8% CPU use with 32 MPI tasks x 1 OpenMP threads
Performance: 17.064 ns/day 1.406 hours/ns 98.749 timesteps/s

MPI task timings breakdown
Section | min time | avg time | max time |%varavg| %CPU | %total

Hi Axel,

Thanks for such detailed analysis. A good news for me is I finally get the MPI+OMP parallelization working with the help from the staff from the supercomputer and the previous problem was due to mis-compilization.

I’ve done some test following your suggestions and it goes well. The only problem I met is about increasing the cutoff to increase the efficiency. According to your previous suggestions, I used neighbor multi + communicate multi +neigh_modify one (if needed), but it turns out that all the efficiency are lowered. In your opinion, have I done something wrong on this method?

Best
Ming

Hi Axel,

Thanks for such detailed analysis. A good news for me is I finally get the
MPI+OMP parallelization working with the help from the staff from the
supercomputer and the previous problem was due to mis-compilization.
  I've done some test following your suggestions and it goes well. The only
problem I met is about increasing the cutoff to increase the efficiency.
According to your previous suggestions, I used neighbor multi + communicate
multi +neigh_modify one (if needed), but it turns out that all the
efficiency are lowered. In your opinion, have I done something wrong on this
method?

there are far too many subtle factors involved.
i don't think that the difference in cutoffs
is large enough to warrant the multi flags
at least not at the node count you are testing.

it is all about the relative cost of operations and the relative
speedup from multi-threading.

your problem is extremely ugly, because you have lots of time to spend
on kspace. but for the most part you are just accumulating and
communicating zeroes there.

it also depends on the size of the simulated system. is this the final
size, or will you do larger systems?
there are two more possible things to try, but both will need some
programming...

axel.

Hi Axel,

In fact I've increased the efficiency a lot according to your suggestions
and I'm quite happy with it. Thanks again for your so many and useful
suggestions.
  For the cutoff, as there are 759 water molecules, i.e. as long as the
cutoff is big enough (5.0 nm), the neighor with > 2000 (the default value),
but just as you suggest, these need do some experiments, and I'll try.
  This is one of the several systems I'm working on, and it'll be nice if
the efficiency for these particularly 'ugly' systems could be increased, but
if they need some programming, if they could increase the efficiency say
30-40% and the programming itself is not that difficult, I would like to
give a try, otherwise, I'll go with the present set-up.

Best
Ming