Hi Lammps users,
At CIEMAT we have a cluster with 8 nodes, each node having 1 K80 and 8 procs. I am having some problems to run LAMMPS with SLURM. Instead of mpirun we use srun to run LAMMPS. When I send the job with for instance 8 procs, what I see is that the job is not shared between the 8 procs but is launched 8 times on 1 proc.
The script I use to use 8 procs and 1 K80 (2 x K40) is the following :
#SBATCH --job-name=LAMMPS_test
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --partition=gpu
#SBATCH --gres=gpu:kepler:2
#SBATCH --time=01:00:00
module load mvapich2 cuda
srun lammps -sf gpu -pk gpu 2 -in in.input
exit 0
Then I send the job with sbatch myscript.sh.
And below you can see the beginning of the log file. Clearly, the job is repeated 8 times on 1 proc, instead of getting 2 by 2 by 2 MPI processor grid.
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
LAMMPS (14 May 2016)
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Lattice spacing in x,y,z = 2.8553 2.8553 2.8553
Created orthogonal box = (0 0 0) to (182.739 182.739 182.739)
1 by 1 by 1 MPI processor grid
Any idea how to solve this ?
Many thanks in advance.
Saludos,
Christophe