MPI performance of lammps with openmpi 1.6.5

Dear all lammps users.

Recently I installed lammps (10Aug15) on our LAB cluster with default option (make mpi).

The spec of the cluster is following :

The cluster has number of three nodes and each node is consisted with following :

CPU : E5-2690v2 3.0GHz, 25MB, 8.0GT/s QPI, Turbo, 10C, 130W (x2)

Ram : 96GB(6 x 16GB) 1600Mhz, Low Volt, Dual Ranked RDIMMs

HDD : 250GB 7.2K RPM SATA 2.5" Hot Plug Hard Drive (x2)

Network : Broadcom 5720 Quad Port 1GbE NIC

OS : Centos 12.2.x86_64

Simply each node has 20 core. So total number of core is 60 (3 x 20).

When I test for simple N = 32000 LJ liquid simulation, there was performance problem.

I tested with 20 and 40 core with openmpi 1.6.5.

The CPU time for 40 core simulation was about 210 sec while the 20 core was only 27 sec.

In other words, 40 core simulation is about 8 times slower than 20 core simulation.

As I think, it is something strange.

When I check the benchmark for N = 32000 LJ simulation at the

http://lammps.sandia.gov/bench/lj_cluster.html

The 40 core simulation is faster than 20 core simulation.

Do I have to add more commend options for job submit ?

I would be grateful for any comments.

Thank you !

Regards,
Seongmin Jeong
Graduate Student

I am not sure if the timings there are for a scaled system or not. If the communication between nodes is slow, this might slow down the simulation considerably, especially if each node is not doing a lot. What happens if you scale the system so that each node or core simulations about 32000 atoms?

Dear all lammps users.

Recently I installed lammps (10Aug15) on our LAB cluster with default
option (make mpi).

The spec of the cluster is following :

The cluster has number of three nodes and each node is consisted with
following :

CPU : E5-2690v2 3.0GHz, 25MB, 8.0GT/s QPI, Turbo, 10C, 130W (x2)

Ram : 96GB(6 x 16GB) 1600Mhz, Low Volt, Dual Ranked RDIMMs

HDD : 250GB 7.2K RPM SATA 2.5" Hot Plug Hard Drive (x2)
Network : Broadcom 5720 Quad Port 1GbE NIC
OS : Centos 12.2.x86_64

centos 12??? they only released up to version 7.x so far.​

Simply each node has 20 core. So total number of core is 60 (3 x 20).

When I test for simple N = 32000 LJ liquid simulation, there was
performance problem.

I tested with 20 and 40 core with openmpi 1.6.5.

The CPU time for 40 core simulation was about 210 sec while the 20 core
was only 27 sec.

In other words, 40 core simulation is about 8 times slower than 20 core
simulation.

As I think, it is something strange.

When I check the benchmark for N = 32000 LJ simulation at the

http://lammps.sandia.gov/bench/lj_cluster.html

The 40 core simulation is faster than 20 core simulation.

Do I have to add more commend options for job submit ?

​there is nothing strange. the problem is that you have TCP/IP networking
over gigabit ethernet. this has very high latencies (in the milliseconds)
and since the ​computation in classical MD is fast and after that you need
to communicate, communication latencies are important for parallel scaling.
if you had a high-speed/low-latency interconnect your communication
latencies are orders of magnitude smaller (in the single digit
microseconds), which is needed for good performance and good parallel
scaling. when you run on a single node only, there is a shortcut and you
don't have to go across the network layer.
also, as a rule of the thumb, you should keep in mind that the latency
requirements roughly scale with the number of cores per node. so with 20
cores per node, the situation is about 20 times worse than with a single
processor per node.

to reduce the latency issues, you can use MPI+OpenMP, but it will in no way
give you the performance if a high-speed interconnect (and with 20 cores /
node, you need to use OpenMP+MPI on those machines, too, to unleash their
full potential. using 2-4 threads per MPI task and 10-5 MPI tasts per node,
respectively).

axel.