[lammps-users] parallel job error

_zhenlong_li · May 2, 2008, 7:06pm

Dear all,

There is a problem with my parallel LAMMPS jobs on our cluster. Typically,
the parallel job runs for several minutes (within half an hour) and then ceases
with p4_error as below:

p4_17484: p4_error: : 1
rm_l_4_17486: (1.308594) net_send: could not write to fd=5, errno = 32
p3_28430: p4_error: : 1
rm_l_3_28432: (1.308594) net_send: could not write to fd=5, errno = 32
p4_17484: (1.308594) net_send: could not write to fd=5, errno = 32
p6_17916: p4_error: interrupt SIGx: 13
p0_14701: p4_error: interrupt SIGx: 13
p3_28430: (9.324219) net_send: could not write to fd=5, errno = 32
p6_17916: (11.328125) net_send: could not write to fd=5, errno = 32

for this job, I used 8 processors:
#PBS -l nodes=4:ppn=2

There are overall 81000 particles in the simulation box. Previously,
all parallel job for a similar smaller system containing 24000 particles ran smoothly
on the same cluster. So I suspect this error may be related to the size of the
system due to the failed communication between different CPUs.

Any comments on the possible reasons for such kind of failure are appreciated.

Our cluster contains mainly Intel Pentium 4 Xeon processors and
Red Hat Enterprise Linux runs on all the nodes. Let me know if you need any
other information.

Thanks!

Zhenlong

akohlmey · May 2, 2008, 7:34pm

Dear all,

There is a problem with my parallel LAMMPS jobs on our cluster.
Typically,
the parallel job runs for several minutes (within half an hour) and then
ceases
with p4_error as below:

this looks like you have a broken machine in that cluster.

that can have a lot of reasons, but the messages indicate that
your job stops to communicate. that can be because the job on
that specific machine crashed or because the network hardware
has a failure. this is definitely not a LAMMPS problem. you
have to check your machines.

cheers,
axel.