[lammps-users] Problems with pair_coeff: Too long of a cut?

Brian_Giera · October 15, 2009, 12:54am

I have a simulation box that, after using the replicate command, spans (-5.38609 -5.38609 -11.8495) to (37.7026 37.7026 11.8495) and has ~8900 particles. The box is periodic in x and y, but not z.

When I use a r_cut=19.5 for my Coulombic potential, I get the following error message (copied directly from the output):

Setting up run …
p0_10196: p4_error: interrupt SIGSEGV: 11
p3_10214: p4_error: interrupt SIGSEGV: 11
p2_10208: p4_error: interrupt SIGSEGV: 11
rm_l_2_10209: (0.417969) net_send: could not write to fd=5, errno = 32
p5_10226: p4_error: interrupt SIGx: 13
p4_10220: p4_error: interrupt SIGx: 13
p1_10202: p4_error: interrupt SIGSEGV: 11
rm_l_1_10203: (0.484375) net_send: could not write to fd=5, errno = 32
p1_10202: (0.484375) net_send: could not write to fd=5, errno = 32
p6_10232: p4_error: interrupt SIGx: 13
rm_l_3_10215: (0.359375) net_send: could not write to fd=5, errno = 32
p3_10214: (12.359375) net_send: could not write to fd=5, errno = 32
p2_10208: (12.421875) net_send: could not write to fd=5, errno = 32
p5_10226: (12.242188) net_send: could not write to fd=5, errno = 32
p4_10220: (12.300781) net_send: could not write to fd=5, errno = 32
p6_10232: (14.183594) net_send: could not write to fd=5, errno = 32

I also have the following specified: “neigh_modify every 1 delay 10 check yes page 500000 one 10000”
if that is relevant.

I am not sure where I am going wrong nor how to interpret the error message. Is this specific to my computer? I can provide additional information as needed.

I appreciate any help,

Brian

Brian_Giera · October 15, 2009, 4:57am

It also does not work if I set r_cut for the Coulombic potential until as low as 9.5. I’m thinking the error is with my computer.

akohlmey · October 15, 2009, 1:47pm

It also does not work if I set r_cut for the Coulombic potential until as
low as 9.5. I'm thinking the error is with my computer.

... or with your input. have you tried running the same system
with just one processor? you seem to be using MPICH, which has
the unfortunate habit of buffering all i/o and thus cutting off error
messages if they happen on remote nodes.

if you want somebody to help you with this, please post a
(small, but complete) input, so that somebody can try to
reproduce it.

cheers,
axel.

Brian_Giera · October 15, 2009, 5:13pm

It does work with r_cut = 19.5 while running on just one processor, but not with two or more.

Here’s a trimmed down version of my input (both attached & copied below):

Initialize simulation box

dimension 3
boundary p p s
units lj
atom_style charge

Create geometry

lattice sc 0.8
region simbox block -5.0 5.0 -5.0 5.0 -11.0 11.0
create_box 5 simbox

Create regions

region a_wall block INF INF INF INF -11.0 -11.0
region b_wall block INF INF INF INF 11.0 11.0
region a_charge block INF INF INF INF -10.0 -10.0
region b_charge block INF INF INF INF 10.0 10.0
region A_1 block -4.0 -4.0 INF -3.5 -9.0 9.0
region B_1 block -3.0 -3.0 INF -3.5 -9.0 9.0
region A_2 block -2.0 -2.0 INF -3.5 -9.0 9.0
region B_2 block -1.0 -1.0 INF -3.5 -9.0 9.0

Insert particles

create_atoms 1 region a_charge
create_atoms 2 region b_charge
create_atoms 3 region a_wall
create_atoms 3 region b_wall
create_atoms 4 region A_1
create_atoms 4 region A_2
create_atoms 5 region B_1
create_atoms 5 region B_2

replicate 4 4 1

Create groups

group a_charge type 1
group b_charge type 2
group walls type 3
group A type 4
group B type 5
group A_B type 4 5
group all type 1 2 3 4 5

Set masses

mass 1 1.0
mass 2 1.0
mass 3 1.0
mass 4 1.28
mass 5 1.28

pair_style lj/cut/coul/cut 2.5 19.5
pair_coeff 12 * 0 0.01 0 19.5 #I get errors if r_cut_Coul is >10.472 if running on two or more processors.
pair_coeff 3 3 0.4 1.0 2.5 0
pair_coeff 45 45 0 0.1 0 0

neigh_modify every 1 delay 10 check yes page 500000 one 10000

Set charges

set group A charge -1.1
set group B charge 1.1
set group walls charge 0.0
set group a_charge charge 0.0
set group b_charge charge 0.0

Initilize velocities

velocity A_B create 2.2 13

Misc

fix a_charges a_charge setforce 0.0 0.0 0.0
fix b_charges b_charge setforce 0.0 0.0 0.0
fix walls walls setforce 0.0 0.0 0.0
fix thermostat A_B langevin 2.2 2.2 75.0 3
fix timeintegration A_B nve
timestep 0.001
thermo_style custom step temp etotal cpu
thermo 1000

run 2000

Thanks again for any help,

Brian

in.Test (1.81 KB)

akohlmey · October 15, 2009, 5:59pm

It does work with r_cut = 19.5 while running on just one processor, but not
with two or more.
Here's a trimmed down version of my input (both attached & copied below):

this input works for me with a recent version of lammps.
i tested it with up to 64 mpi tasks and had no problem.

cheers,
axel.

Brian_Giera · October 15, 2009, 6:53pm

Does this imply:

that my computer is not big enough to run a system with an r_cut greater than ~10.5 on multiple processors? To clarify, the code runs fine always when r_cut is less than 10.5. Also, it does run with r_cut as large as 19.5 when I use only one processor.
that my compiled version (lmp_debian) of the most recent version of LAMMPS is faulty?
that something is wrong with my “mpirun” and I need to re-download it?
Forgive me for my lack of computer terminology, I am very new to this and I am trying to determine the source of this error, be it a hardware or software issue. And if a software issue, what component of my LAMMPS installation do I need to fix.

Lastly, I am using an 8 processor Linux machine with 4 GBs of memory per processor, if that helps answer any of my questions.

Thanks for your help thus far,

Brian

akohlmey · October 15, 2009, 7:44pm

Does this imply:

that my computer is not big enough to run a system with an r_cut greater
than ~10.5 on multiple processors? To clarify, the code runs fine always

no. the memory requirements for this system with 4 mpi tasks are
less than 20MB per processor.

when r_cut is less than 10.5. Also, it does run with r_cut as large as 19.5
when I use only one processor.
that my compiled version (lmp_debian) of the most recent version of LAMMPS
is faulty?

that is a possibility (but unlikely). you could edit Makefile.debian
(in src/MAKE) and
change the optimization flags from " -O " to " -O
-fno-strict-aliasing " which is a
safer setting for recent GNU compilers that very aggressively assume strict c++
standard compliance (which is not true for lammps).

that something is wrong with my "mpirun" and I need to re-download it?

that is another possibility. but it is also not overly likely. from
the error messages
you posted, it looks like you are using MPICH-1 and that has been used a lot
with LAMMPS and should work. if it is communication related issue, then it
is more likely, that your communication hardware doesn't work so well or its
driver is to able to handle large load. i would check the kernel message buffer
(dmesg command) and see if there is anything unusual. contact a local linux
expert to help you with this, if you don't know much about it yourself. it is
very difficult to debug these kind of problems remotely via e-mail.

the important message is that your input _does_ work in principle and
in parallel,
so it is somewhat unlikely that LAMMPS is to blame (there is still the
possibility
that it is related to using uninitialized memory, which may be set to zero on my
machine and differently on yours and then it works on my machines by accident
and not by construction).

Forgive me for my lack of computer terminology, I am very new to this and I
am trying to determine the source of this error, be it a hardware or
software issue. And if a software issue, what component of my LAMMPS
installation do I need to fix.

intermittent errors are always to worst to debug. i would say contact somebody
that knows linux very well and can run some independent communication benchmarks
to validate the network and try recompiling with -fno-ansi-aliasing to
avoid the
main cause for miscompiled lammps binaries on recent linux machines.

Lastly, I am using an 8 processor Linux machine with 4 GBs of memory per
processor, if that helps answer any of my questions.

what a waste to run lammps on this.

cheers,
axel.

Brian_Giera · October 19, 2009, 4:48pm

With the help of a local Linux expert, I ran the “86 plus” memory test, the dell hardware diagnostic test, and dmseg, which all indicated there was no problem.

I edited Makefile.debian and re-compiled according to your suggestion, but the problem persists.

Are there other suggestions you have as to the source of this problem?

Thanks,

Brian

akohlmey · October 19, 2009, 5:35pm

With the help of a local Linux expert, I ran the "86 plus" memory
test, the dell hardware diagnostic test, and dmseg, which all
indicated there was no problem.
I edited Makefile.debian and recompiled according to your suggestion,
but the problem persists.

Are there other suggestions you have as to the source of this problem?

well, you didn't mention doing a network or parallel computing
stress test. that is the next step that i would try.

the other main difference between your set up and mine seems
to be that you use MPICH as MPI library while a am using OpenMPI,
but that should not make a difference (famous last words).

axel.

Brian_Giera · October 19, 2009, 6:08pm

This is all new information to me, so forgive my continual questioning:

Firstly, the computer I’m using is not a cluster, but one computer with a bunch of RAM and two quad-cores. Should I still do the network or parallel computing test? If so, do I need to download something to perform such a test?

I’ll also look into the MPI situation.

Thanks again,

Brian

akohlmey · October 19, 2009, 6:34pm

This is all new information to me, so forgive my continual
questioning:

Firstly, the computer I'm using is not a cluster, but one computer
with a bunch of RAM and two quad-cores. Should I still do the network

if you are not running across a network, then it is indeed
strange that you cannot run your input. i suggest that to
grab your linux expert again, compile your code with debug
info included, enable core dumps, make the code crash and
then produce a stack trace to find out, where exactly the
program crashes. alternately, a run under valgrind might
help, too, but that is a bit more complicated.

or parallel computing test? If so, do I need to download something to
perform such a test?

there are some MPI benchmarks for example that you can download
and compile to test your MPI installation.

cheers,
axel.

Brian_Giera · October 19, 2009, 6:38pm

One more thing:
I’m starting to look at / better understand the neighbor, neigh_modify, and communicate commands.

How much impact might these have on this error? If this only useful in optimizing how fast the code runs?

I ask this because I saw this in the neigh_modify documentation:
IMPORTANT NOTE: LAMMPS can crash without an error message if the number of neighbors for a single particle is larger than the page setting, which means it is much, much larger than the one setting.

I’m using the default settings for neighbor and have:
neigh_modify every 1 delay 10 check yes page 500000 one 10000
I currently do not use the communicate command.

akohlmey · October 19, 2009, 6:48pm

why "delay 10"? there is not much gain from setting this number
very that high and only the risk of getting dangerous builds.
if you suspect problems, setting "delay 0" is the way to go.
i ran the input you posted on a few machines and it ran just
fine, so there is no reason why it should not work.

cheers,
axel.

Brian_Giera · October 20, 2009, 12:22am

I’ve been looking on LAMMPS’s website how to edit the makefile in order to include debug info to no avail. I also can’t tell from just looking at the makefile what to do.

How would I do this? Is there a place on the LAMMPS website that talks about this?

Also, what do you mean by “core dump?”

Thanks for all of your help,

Brian