[lammps-users] Error of MPI when using lammps

11143 · October 1, 2010, 1:40pm

Hi,everyone,
I installed lammps at my cluster(using Rocks Cluster).And I also intalled MPICH2,I use NFS to share the files of lammps and MPICH2 at /share/apps.
But when I run the program with lammps at the step of 3700,the error will occur.
Everytime It occurs at 3700.
I use a cluster with 4 nodes,every node have four cores and a memory of 12GB,hard disk with 250GB.
The errors are the following:
Step Temp con_T newton sample PotEng TotEng pe_44
3593 255.61938 292.65297 293.03206 293 -5312368.3 -5271997 -4850306.9
3600 255.75621 294.19217 293.0624 293.15684 -5312400.6 -5272007.7 -4850339.6
3650 255.92391 292.6931 293.40939 293.34906 -5312306 -5271886.6 -4850242
3700 255.17754 291.62481 292.57333 292.49354 -5312281.6 -5271980.1 -4850218.2
rank 18 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 18: killed by signal 11
rank 17 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 17: killed by signal 9
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fffd6c03a5c, status0x7fffd6c03a40) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(174)…: MPI_Send(buf=0x2ad9c9799010, count=8259, MPI_DOUBLE, dest=9, tag=0, MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fffdf75460c, status0x7fffdf7545f0) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Send: Other MPI error, error stack:
MPI_Send(174)…: MPI_Send(buf=0x2b3051520010, count=8274, MPI_DOUBLE, dest=10, tag=0, MPI_COMM_WORLD) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff88640f4c, status0x7fff88640f30) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff840f699c, status0x7fff840f6980) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff586a8afc, status0x7fff586a8ae0) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fffe1d9be1c, status0x7fffe1d9be00) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff4212bbfc, status0x7fff4212bbe0) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff268eeb3c, status0x7fff268eeb20) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_hanFatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fff50d16fdc, status0x7fff50d16fc0) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
dler(1446)…: socket closed
Fatal error in MPI_Wait: Other MPI error, error stack:
MPI_Wait(157)…: MPI_Wait(request=0x7fffab413c5c, status0x7fffab413c40) failed
MPIDI_CH3I_Progress(150)…:
MPID_nem_mpich2_blocking_recv(948):
MPID_nem_tcp_connpoll(1720)…:
state_commrdy_handler(1556)…:
MPID_nem_tcp_recv_handler(1446)…: socket closed
rank 47 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 47: return code 1
rank 45 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 45: return code 1
rank 61 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 61: return code 1
rank 60 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 60: return code 1
rank 12 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
exit status of rank 12: killed by signal 9

How can I solve this problem?
Another problem is:
I can’t use “mpdboot -n X -f mpd.hosts” to boot MPI;the error is:
[lammps@…2130… ~]$ mpdboot -n X -f mpd.hosts
mpdboot_lammps.hpc.org (handle_mpd_output 415): failed to connect to mpd on lammps

but I can use:
mpd &
mpdtrace -l(at the frontend)
mpd -h lammps.hpc.org -p XXXX & (at the compute nodes)
to boot MPI,why?How can I solve this problem?
Thank you!
|

sjplimp · October 1, 2010, 1:52pm

I don’t know - I would print out thermo every step and see if the
system is behaving badly near when it crashes. Can you run
the same problem on 1 proc, or 2 procs?

Steve

sjplimp · October 1, 2010, 1:53pm

Also, can you save a restart file right before
the crash (e.g. step 3593 or closer) and
restart and it still crashes in (nearly) the
same place?

Steve

2010/10/1 Steve Plimpton <[email protected]>

akohlmey · October 1, 2010, 2:02pm

Hi,everyone,
I installed lammps at my cluster(using Rocks Cluster).And I also intalled MPICH2,I use NFS to share the files of lammps and MPICH2 at /share/apps.
But when I run the program with lammps at the step of 3700,the error will occur.
Everytime It occurs at 3700.
I use a cluster with 4 nodes,every node have four cores and a memory of 12GB,hard disk with 250GB.
The errors are the following:
Step Temp con_T newton sample PotEng TotEng pe_44
    3593 255.61938 292.65297 293.03206 293 -5312368.3 -5271997 -4850306.9
    3600 255.75621 294.19217 293.0624 293.15684 -5312400.6 -5272007.7 -4850339.6
    3650 255.92391 292.6931 293.40939 293.34906 -5312306 -5271886.6 -4850242
    3700 255.17754 291.62481 292.57333 292.49354 -5312281.6 -5271980.1 -4850218.2
rank 18 in job 3 lammps.hpc.org_60253 caused collective abort of all ranks
  exit status of rank 18: killed by signal 11

signal 11 is segmentation fault. that typically happens
when your simulation goes badly wrong.

[...]

Another problem is:
I can't use "mpdboot -n X -f mpd.hosts" to boot MPI;the error is:
[[email protected]... ~]$ mpdboot -n X -f mpd.hosts
mpdboot_lammps.hpc.org (handle_mpd_output 415): failed to connect to mpd on lammps

this is an MPICH question and should be sent to the
corresponding mailing list(s).

axel.