Problem running LAMMPS with SLURM

Hi Lammps users,

At CIEMAT we have a cluster with 8 nodes, each node having 1 K80 and 8 procs. I am having some problems to run LAMMPS with SLURM. Instead of mpirun we use srun to run LAMMPS. When I send the job with for instance 8 procs, what I see is that the job is not shared between the 8 procs but is launched 8 times on 1 proc.

The script I use to use 8 procs and 1 K80 (2 x K40) is the following :

#SBATCH --job-name=LAMMPS_test
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --partition=gpu

#SBATCH --gres=gpu:kepler:2
#SBATCH --time=01:00:00

module load mvapich2 cuda

srun lammps -sf gpu -pk gpu 2 -in in.input

exit 0

Then I send the job with sbatch myscript.sh.

And below you can see the beginning of the log file. Clearly, the job is repeated 8 times on 1 proc, instead of getting 2 by 2 by 2 MPI processor grid.

LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid

Any idea how to solve this ?

Many thanks in advance.
Saludos,
Christophe

Hi Lammps users,

At CIEMAT we have a cluster with 8 nodes, each node having 1 K80 and 8
procs. I am having some problems to run LAMMPS with SLURM. Instead of mpirun
we use srun to run LAMMPS. When I send the job with for instance 8 procs,
what I see is that the job is not shared between the 8 procs but is launched
8 times on 1 proc.

The script I use to use 8 procs and 1 K80 (2 x K40) is the following :

#SBATCH --job-name=LAMMPS_test
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --partition=gpu
#SBATCH --gres=gpu:kepler:2
#SBATCH --time=01:00:00

module load mvapich2 cuda

srun lammps -sf gpu -pk gpu 2 -in in.input

exit 0

Then I send the job with sbatch myscript.sh.

And below you can see the beginning of the log file. Clearly, the job is
repeated 8 times on 1 proc, instead of getting 2 by 2 by 2 MPI processor
grid.

LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
  1 by 1 by 1 MPI processor grid

Any idea how to solve this ?

that means you have either compiled LAMMPS without MPI support or
using a different MPI library than what you use for the parallel run.

try with some small MPI example program and see, if you get it to
work, and if not, contact your local user support.

axel.

Dear Axel,

Any idea how to solve this ?

that means you have either compiled LAMMPS without MPI support or
using a different MPI library than what you use for the parallel run.

I have compiled LAMMPS with MVAPICH2, GPU and Voronoi packages, as usual. When I run it with mpirun (on 1 node of 8 procs) instead of srun there is no problem.
However, if I use mpirun with more than 2 nodes, then I get a different error…
I will ask the administrator to check whether the library is the same.

try with some small MPI example program and see, if you get it to
work, and if not, contact your local user support.

Ok, thanks for the advice. I will try.

Christophe