Slow Speed When Multiple Node Utilization

Dear All,

I am using latest version of Lammps on Vega Cluster(CentOS). When I am utilizing single node(16 processors), simulations are running smoothly with good speed. But it become very slow when I am utilizing 2 nodes(32 processors). My simulation box size is 808060 angstrom^3.

I checked the No of iteration completed in 1 hour. Roughly its around

  1. Vega cluster(1 node) - 210000 [4 by 2 by 2 MPI processor grid]
  2. Vega Cluster(2 node) - 48000 [4 by 4 by 2 MPI processor grid]

Can someone tell what is the problem and how to resolve it?

Dear All,

I am using latest version of Lammps on Vega Cluster(CentOS). When I am
utilizing single node(16 processors), simulations are running smoothly with
good speed. But it become very slow when I am utilizing 2 nodes(32
processors). My simulation box size is 80*80*60 angstrom^3.

I checked the No of iteration completed in 1 hour. Roughly its around

1. Vega cluster(1 node) - 210000 [4 by 2 by 2 MPI processor grid]
2. Vega Cluster(2 node) - 48000 [4 by 4 by 2 MPI processor grid]

Can someone tell what is the problem and how to resolve it?

not without knowing more about the hardware and whether it is problems
set up and operated.
for LAMMPS to show good parallel scaling, especially with many CPU
cores per node, you need a properly working low-latency high-speed
interconnect and use an MPI library that is properly set up for it.
also, it has to be guaranteed, that you have exclusive access to the
reserved nodes and that there are no parasitic calculations running on
them or leftovers from previous jobs that had their reservation
expired.

axel.

Dear Sir,

Thanks for your reply. I communicated with the system administrator in our campus, he confirm that there are 8-cores socket x 2 = 16 cores on each node and all are multicore and interconnected by high-speed Infiniband Switch. And confirmed that I have exclusive access to the all reserved nodes, and there are no leftovers from the previous jobs.

And log file shows 4 by 4 by 2 MPI processor grid when using 2 nodes(32 cores), 4 by 2 by 2 MPI processor grid for single node(16 cores), after reading the data file. I submitted job through SLURM scheduler with the following command in script file.

mpirun -np 32 -machinefile $MACHINEFILE /opt/Lammps/lammps-10Aug15/src/lmp_mpi < sample.in

I am not understanding how the simulation speed can decrease(approx ~ single processor), when accessing multiple node. But the simulations are running smoothly on single node.

Maybe I missed this: how many atoms are in your system? Do you scale it up with the number of processors used? I could imagine that if the number of atoms per processor is too low, most of the time is spent communicating, no? You can probably see if this happens from the timings LAMMPS prints out at the end of a simulation.

Dear Sir,

Thanks for your reply. I communicated with the system administrator in our
campus, he confirm that there are 8-cores socket x 2 = 16 cores on each node
and all are multicore and interconnected by high-speed Infiniband Switch.
And confirmed that I have exclusive access to the all reserved nodes, and
there are no leftovers from the previous jobs.

you missed one point: whether the MPI library you are using for your
compilation of LAMMPS is actually making use of that fast
interconnect.
there also would be the question of whether the 16 cores per node are
real cores or whether the compute nodes have hyperthreading enabled.

And log file shows 4 by 4 by 2 MPI processor grid when using 2 nodes(32
cores), 4 by 2 by 2 MPI processor grid for single node(16 cores), after
reading the data file. I submitted job through SLURM scheduler with the
following command in script file.

     mpirun -np 32 -machinefile $MACHINEFILE
/opt/Lammps/lammps-10Aug15/src/lmp_mpi < sample.in

I am not understanding how the simulation speed can decrease(approx ~ single
processor), when accessing multiple node. But the simulations are running
smoothly on single node.

in general, this is very difficult to debug from remote. you have to
work on this with your local HPC experts. rather than using your own
input, you would be better off using the benchmark inputs bundled with
LAMMPS and try to reproduce the scaling benchmark data presented here:
http://lammps.sandia.gov/bench.html#lj

it is very unlikely that this is a LAMMPS issue, so it has to be
either a problem of your machine setup, your compilation/installation,
or your input. you can eliminate the last item through using the
inputs in the bench directory. the others are something that we cannot
know from remote.

axel.

Dear Sir,

My system contains 24000 atoms and I scale it up with number of processors.

Right now I am not having the satisfactory answers for the question raised by Dr. Axel. I’ll work with benchmark input and update later.

Thanks for your valuable time.