Error with running multiple processors with pppm but not ewald

Benjamin_Cowen · May 1, 2014, 4:02pm

I am having difficulty running a simulation on several processors. The script below is provided. When I run this on 4 processors with pppm, it works fine. When I run on 5 or 6 processors, I get the error below. However, with the ewald solver, I do not have any problems. The simulation size is 1296 atoms.

[ubuntu:47558] *** An error occurred in MPI_Waitany
[ubuntu:47558] *** on communicator MPI COMMUNICATOR 5 DUP FROM 0
[ubuntu:47558] *** MPI_ERR_TRUNCATE: message truncated
[ubuntu:47558] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

akohlmey · May 1, 2014, 7:56pm

I am having difficulty running a simulation on several processors. The
script below is provided. When I run this on 4 processors with pppm, it
works fine. When I run on 5 or 6 processors, I get the error below. However,
with the ewald solver, I do not have any problems. The simulation size is
1296 atoms.

there is not enough information here.
- you don't provide the *real* error message. what you quote is only
the error from the MPI library. there is likely an error or warning
preceding that.
- you don't say whether you have tested the latest version of LAMMPS
- you only quote your input script, but don't provide the other
necessary files to try reproduce the error.

if i put the code following the "timestep" command into the rhodo
benchmark input, it works fine with any number of processors i can
test, which suggests that the problem is either with your system
geometry or your force field.

axel.

Benjamin_Cowen · May 1, 2014, 8:13pm

There is no warning preceding that error message.
The data file is of quartz, generated in materials studio (This data file has worked for all other potentials I have used)
It also works for the ewald solver, not pppm.
Why would the number of processors matter? It runs fine with 4 processors, but not 5? How does that indicate it is a problem with geometry or force field? I am not presuming to say it does not, I just do not understand why the # of processors matters. Running fine with 4 processors and not with 5 does not make sense to me, which is why I am asking for an expert’s opinion.
Could virtualization contribute to this error?
I was using the January 2014 version of LAMMPS

akohlmey · May 1, 2014, 8:39pm

1) There is no warning preceding that error message.

but you could also have output indicating an unstable time
integration. or the output was on a remote processor.

2) The data file is of quartz, generated in materials studio (This data file
has worked for all other potentials I have used)

so what? without those files nobody but you can reproduce the crash.

3) It also works for the ewald solver, not pppm.

so what?

4) Why would the number of processors matter? It runs fine with 4
processors, but not 5? How does that indicate it is a problem with geometry
or force field? I am not presuming to say it does not, I just do not
understand why the # of processors matters. Running fine with 4 processors
and not with 5 does not make sense to me, which is why I am asking for an
expert's opinion.

again, you are making it extremely hard to not call you out for
jumping to conclusions without knowing what you are talking about. if
you would spend some time to read about how LAMMPS is parallelized,
then it should make sense.

5) Could virtualization contribute to this error?

why?

6) I was using the January 2014 version of LAMMPS

so then use the current version and see, if the error still exists.

without knowing whether the issue is present in the current code and
without being able to reproduce it, nobody will spend a second thought
on this. unless you make an convincing case that this is a LAMMPS
problem and not bad input you'll be on your own.

axel.

sjplimp · May 2, 2014, 2:03pm

Another important diagnostic is
how soon it happens, and are the 2 runs

(on different procs) identical up to that
point. And is the thermo output of the
run that crashes, stable up to that point.

Steve

akohlmey · May 2, 2014, 2:11pm

steve,

this discussion went (unnoticed by me) off-list. in summary, that
particular example crashes during setup(), the LAMMPS version is old
and current LAMMPS works. so it is likely due to a bug we already
fixed.

Benjamin_Cowen · May 2, 2014, 2:39pm

I tested this hypothesis. I found the contrary.

So it is most likely to due with my hardware, and not a problem with LAMMPS. Thank you both for your help on this problem.

bjcowen@…33…1531…:~/Desktop/lammps-30Apr14/MyExamples/quartz/hexagonal/morse_pppm/STP$ mpirun -np 5 ~/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup < in.morse
LAMMPS (30 Apr 2014)
Reading data file …
triclinic box = (-7.50597 -0.236377 -0.300289) to (21.972 25.2923 21.3205) with tilt (-14.739 0 0)
5 by 1 by 1 MPI processor grid
reading atoms …
1296 atoms
Finding 1-2 1-3 1-4 neighbors …
0 = max # of 1-2 neighbors
0 = max # of 1-3 neighbors
0 = max # of 1-4 neighbors
1 = max # of special neighbors
Changing box …
triclinic box = (-7.50597 -0.236377 -0.300289) to (21.972 25.2923 21.3205) with tilt (-14.739 0 0)
WARNING: Resetting reneighboring criteria during minimization (…/min.cpp:173)
PPPM initialization …
G vector (1/distance) = 0.242374
grid = 40 16 32
stencil order = 5
estimated absolute RMS force accuracy = 1.61323e-05
estimated relative force accuracy = 1.12033e-06
using double precision FFTs
3d grid and FFT values/proc = 15249 4480
Setting up minimization …
[ubuntu:12054] *** An error occurred in MPI_Waitany
[ubuntu:12054] *** on communicator MPI COMMUNICATOR 9 DUP FROM 0
[ubuntu:12054] *** MPI_ERR_TRUNCATE: message truncated
[ubuntu:12054] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)

Benjamin_Cowen · May 2, 2014, 2:43pm

Also of interest is the different error message for 6 processors:

LAMMPS (30 Apr 2014)
Reading data file …
triclinic box = (-7.50597 -0.236377 -0.300289) to (21.972 25.2923 21.3205) with tilt (-14.739 0 0)
3 by 2 by 1 MPI processor grid
reading atoms …
1296 atoms
Finding 1-2 1-3 1-4 neighbors …
0 = max # of 1-2 neighbors
0 = max # of 1-3 neighbors
0 = max # of 1-4 neighbors
1 = max # of special neighbors
Changing box …
triclinic box = (-7.50597 -0.236377 -0.300289) to (21.972 25.2923 21.3205) with tilt (-14.739 0 0)
WARNING: Resetting reneighboring criteria during minimization (…/min.cpp:173)
PPPM initialization …
G vector (1/distance) = 0.242374
grid = 40 16 32
stencil order = 5
estimated absolute RMS force accuracy = 1.61323e-05
estimated relative force accuracy = 1.12033e-06
using double precision FFTs
3d grid and FFT values/proc = 12870 3840
Setting up minimization …
[ubuntu:12956] *** Process received signal ***
[ubuntu:12956] Signal: Segmentation fault (11)
[ubuntu:12956] Signal code: (128)
[ubuntu:12956] Failing at address: (nil)
[ubuntu:12956] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f300b0f7cb0]
[ubuntu:12956] [ 1] /usr/lib/libfftw.so.2(+0x2f282) [0x7f300c027282]
[ubuntu:12956] [ 2] /usr/lib/libfftw.so.2(fftw+0x150) [0x7f300c0274b0]
[ubuntu:12956] [ 3] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(fft_3d+0x1db) [0x54c573]
[ubuntu:12956] [ 4] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS5FFT3d7computeEPdS1_i+0x1b) [0x54d77b]
[ubuntu:12956] [ 5] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS4PPPM10poisson_ikEv+0x78) [0x7251aa]
[ubuntu:12956] [ 6] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS4PPPM7poissonEv+0x21) [0x71bb5f]
[ubuntu:12956] [ 7] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS4PPPM7computeEii+0x1db) [0x728f17]
[ubuntu:12956] [ 8] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS3Min5setupEv+0x43f) [0x5de409]
[ubuntu:12956] [ 9] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS8Minimize7commandEiPPc+0x1d1) [0x5e2f9f]
[ubuntu:12956] [10] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS5Input15command_creatorINS_8MinimizeEEEvPNS_6LAMMPSEiPPc+0x2e) [0x5cdef9]
[ubuntu:12956] [11] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS5Input15execute_commandEv+0xb37) [0x5cad09]
[ubuntu:12956] [12] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(_ZN9LAMMPS_NS5Input4fileEv+0x2ed) [0x5cb947]
[ubuntu:12956] [13] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup(main+0x46) [0x5d97e2]
[ubuntu:12956] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f300ad4976d]
[ubuntu:12956] [15] /home/bjcowen/Desktop/lammps-30Apr14/src/lmp_ubuntu_backup() [0x48dcf9]
[ubuntu:12956] *** End of error message ***

akohlmey · May 2, 2014, 2:45pm

Also of interest is the different error message for 6 processors:

it crashes inside FFTW and it is FFTW-v2 which is horribly outdated
and extremely error prone, if people don't know what they are doing.

try compiling without FFTW support.

Benjamin_Cowen · May 2, 2014, 2:49pm

Are you suggesting changing:

CCFLAGS = -O -DFFT_FFTW -DLAMMPS_GZIP -DMPICH_IGNORE_CXX_SEEK

to

CCFLAGS = -O -DLAMMPS_GZIP -DMPICH_IGNORE_CXX_SEEK

akohlmey · May 2, 2014, 2:57pm

http://lammps.sandia.gov/doc/Section_start.html#start_2

Benjamin_Cowen · May 2, 2014, 3:04pm

One of the errors I get when I compile is:

bjcowen@…1531…:~/Desktop/lammps-30Apr14/src$ make ubuntu_backup
make[1]: Entering directory `/home/bjcowen/Desktop/lammps-30Apr14/src/Obj_ubuntu_backup’
Makefile:48: angle_charmm.d: No such file or directory
Makefile:48: angle_cosine.d: No such file or directory
Makefile:48: angle_cosine_delta.d: No such file or directory
Makefile:48: angle_cosine_periodic.d: No such file or directory
Makefile:48: angle_cosine_squared.d: No such file or directory
Makefile:48: angle.d: No such file or directory
Makefile:48: angle_harmonic.d: No such file or directory

It does this for several files. However, lammps still compiles. Do you know an error in my makefile that would cause these makefile errors?

akohlmey · May 2, 2014, 3:10pm

does make stop because of that? no! so it is not an error.

...and if you look closely, you'll see that make processes files twice
when compiling and the first round is actually outputting xxx.cpp
processed with -MM to xxx.d

axel

Benjamin_Cowen · May 2, 2014, 3:24pm

I tried adding the fftw3 instead of what I was previously using since the library was outdated. I get this error and compiler stops. I have downloaded fftw3 from the website. I attached my makefile if anyone cares to look at it.

fft3d.o: In function fft_3d': fft3d.cpp:(.text+0x4c): undefined reference to fftw_execute_dft’
fft3d.cpp:(.text+0x8a): undefined reference to fftw_execute_dft' fft3d.cpp:(.text+0xcb): undefined reference to fftw_execute_dft’
fft3d.o: In function fft_3d_create_plan': fft3d.cpp:(.text+0xcf4): undefined reference to fftw_plan_many_dft’
fft3d.cpp:(.text+0xd4f): undefined reference to fftw_plan_many_dft' fft3d.cpp:(.text+0xdb2): undefined reference to fftw_plan_many_dft’
fft3d.cpp:(.text+0xe0d): undefined reference to fftw_plan_many_dft' fft3d.cpp:(.text+0xe73): undefined reference to fftw_plan_many_dft’
fft3d.o:fft3d.cpp:(.text+0xed1): more undefined references to fftw_plan_many_dft' follow fft3d.o: In function fft_1d_only’:
fft3d.cpp:(.text+0xfb5): undefined reference to fftw_execute_dft' fft3d.cpp:(.text+0xfc4): undefined reference to fftw_execute_dft’
fft3d.cpp:(.text+0xfd6): undefined reference to fftw_execute_dft' fft3d.cpp:(.text+0xfea): undefined reference to fftw_execute_dft’
fft3d.cpp:(.text+0xffc): undefined reference to fftw_execute_dft' fft3d.o:fft3d.cpp:(.text+0x100e): more undefined references to fftw_execute_dft’ follow
collect2: ld returned 1 exit status
make[1]: *** […/lmp_ubuntu_backup] Error 1
make[1]: Leaving directory `/home/bjcowen/Desktop/lammps-30Apr14/src/Obj_ubuntu_backup’
make: *** [ubuntu_backup] Error 2

Makefile.ubuntu_parallel (815 Bytes)

akohlmey · May 2, 2014, 3:33pm

I tried adding the fftw3 instead of what I was previously using since the
library was outdated. I get this error and compiler stops. I have downloaded
fftw3 from the website. I attached my makefile if anyone cares to look at

why don't you for a change just pay attention to the advice you are given?