Problems running an instance of LAMMPS (built as a library) with MPI

Roberto_Gomes_de_Agu · January 27, 2018, 12:45am

Hello,

I have LAMMPS built as a library in a cluster with SLURM as the scheduling system. I wrote a Python script that calls LAMMPS and MPI4Py. This script works perfectly on my Ubuntu laptop. When I try to run the same script in the cluster with srun, however, I get the following message (in the case below, with only two procs):

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(674)…:
MPID_Init(195)…: channel initialization failed
MPIDI_CH3_Init(65)…:
MPID_nem_init_ckpt(766)…:
MPIDI_CH3I_Seg_commit(372)…:
MPIU_SHMW_Hnd_deserialize(328)…:
MPIU_SHMW_Seg_open(904)…:
MPIU_SHMW_Seg_create_attach_templ(660): open failed - No such file or directory
srun: error: n022: task 1: Exited with exit code 1
srun: First task exited 30s ago
srun: task 0: running
srun: task 1: exited abnormally
srun: Terminating job step 120283.0
slurmstepd: error: *** STEP 120283.0 CANCELLED AT 2018-01-26T18:14:22 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: n022: task 0: Killed
srun: error: Timed out waiting for job step to complete

I understood that the script works fine in the first node, but it crashes in the second.

After commenting sections of my Python code to try to isolate the source of the problem, I left these two lines only:

from lammps import lammps

lmp=lammps()

and I see that it crashes when it calls lmp=lammps(). Has anyone had the same problem before in a similar situation? LAMMPS compiled as a program (no mode=shlib) works normally when I call it with srun. Any hint or help is more than welcome.

Thanks in advance,

Roberto

akohlmey · January 27, 2018, 7:47am

Hello,

I have LAMMPS built as a library in a cluster with SLURM as the scheduling
system. I wrote a Python script that calls LAMMPS and MPI4Py. This script
works perfectly on my Ubuntu laptop. When I try to run the same script in
the cluster with srun, however, I get the following message (in the case
below, with only two procs):

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(674).................:
MPID_Init(195)........................: channel initialization failed
MPIDI_CH3_Init(65)....................:
MPID_nem_init_ckpt(766)...............:
MPIDI_CH3I_Seg_commit(372)............:
MPIU_SHMW_Hnd_deserialize(328)........:
MPIU_SHMW_Seg_open(904)...............:
MPIU_SHMW_Seg_create_attach_templ(660): open failed - No such file or
directory
srun: error: n022: task 1: Exited with exit code 1
srun: First task exited 30s ago
srun: task 0: running
srun: task 1: exited abnormally
srun: Terminating job step 120283.0
slurmstepd: error: *** STEP 120283.0 CANCELLED AT 2018-01-26T18:14:22 ***
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: n022: task 0: Killed
srun: error: Timed out waiting for job step to complete

I understood that the script works fine in the first node, but it crashes
in the second.

After commenting sections of my Python code to try to isolate the source
of the problem, I left these two lines only:

from lammps import lammps

lmp=lammps()

and I see that it crashes when it calls lmp=lammps(). Has anyone had the
same problem before in a similar situation? LAMMPS compiled as a program
(no mode=shlib) works normally when I call it with srun. Any hint or help
is more than welcome.

this seems to be coming from deep within your MPI installation on your
machine. it could simply mean, that one of the nodes that the python
version was running on has a hardware issue and when you were running the
standalone version, you were running on a different set of nodes. i would
suggest to contact your local cluster sysadmins to help investigate.

axel.