MPI malfunction

Lingnan_Lin · June 10, 2019, 9:11pm

Hi LAMMPS users,

I’ve been using LAMMPS on a cluster that employs SLURM Linux system, and there’s no problem in MPI parallel computing before a maintenance of the cluster (for software update). Now the job cannot be computed in parallel after the maintenance, like I assigned several nodes to the job, and the job was actually computed in 1 by 1 by 1 MPI processor grid, and produced duplicate outputs in the same file (see below).

working directory = /wrk/lnl5/vis_nemd/273K_125mol/2.0e-7
SLURM_SUBMIT_HOST = hercules.hpc.nist.gov
SLURM_JOBID=153461
SLURM_JOB_NODELIST=h[310-311]
SLURM_NNODES=2
SLURM_NTASKS=16
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
LAMMPS (12 Dec 2018)
using 2 OpenMP thread(s) per MPI task
Reading data file …
Reading data file …
Reading data file …
Reading data file …
Reading data file …
Reading data file …
Reading data file …
Reading data file …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …
triclinic box = (15.043 15.043 15.043) to (64.957 64.957 64.957) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms …

The LAMMPS version is Dec 3 2018. And I included modules of mpi/openmpi-x86_64 and intel for the compute. I also tried other mpi like mpich-x86_64, mpich2-x86_64, mvapich-x86_64, but still can’t work properly in parallel. Enclosed are the slurm output (in this case I only run for very few steps) and job submission scripts for your reference.

I’ve tried quite some possible solutions that I found from Internet but none of them work. I hope someone here can give me some suggestions.

Thanks,
Lingnan

slurm-153461.out (37.9 KB)

sbjob1.sh (531 Bytes)

_Diaz_Adrian · June 10, 2019, 9:26pm

Greetings,

You should talk to a system admin or representative for your supercomputer to help you with that. My two cents however, you probably have an issue with using mpiexec instead of srun or viceversa and then theres build commands the system might require of you. Nobody here can typically help you with issues specific to MPI or your system’s operation however, that’s a matter you need to ask your system admin about and hope they can help

Sincerely,
Adrian Diaz

Lingnan_Lin · June 11, 2019, 9:22pm

Hi Adrian,

Thanks for your suggestion. I’ve worked with the system operation team and solved the problem by re-building a new lammps (5 Jun 2019) and a new OpenMPI runtime (4.0.1v). I use the ‘mpirun’ to execute the lammps.

Best,
Lingnan