Question about Lammps run in parallel - Comm Time

Hi, Dear lammps Users,

Generally my question is : Does it take even longer time when lammps run in parallel sometimes?

Here is my case, I have a 404040 simulation box filled with CH_2 polyethylene atoms,

the total number of C and H are 6646. I am running the script on a HPC, the HPC has 164 nodes,

each node has 2 hexa-core processor. ​After defining the potentials and so on,

then here is my script:

velocity all create 10.0 78456

fix 1 all npt temp 10.0 600.0 100 x 10 10 1000 y 10 10 1000

fix 2 all wall/reflect zlo EDGE zhi EDGE
thermo_style custom step press pxx pyy pzz temp pe c_1
thermo 5000

timestep 0.5

run 100000

When I used just one node with 12 processor, the output time is as below: About 13 minutes

Loop time of 798.338 on 12 procs for 100000 steps with 6646 atoms
Pair time () = 417.286 (52.2694) Bond time () = 188.789 (23.6478)
Neigh time () = 14.5956 (1.82825) Comm time () = 165.235 (20.6973)

Outpt time () = 0.031987 (0.0040067) Other time () = 12.4002 (1.55326)

When I used 4 node with 48 processor, the output is: about 3 hours

Loop time of 11587.1 on 48 procs for 100000 steps with 6646 atoms
Pair time () = 108.273 (0.934429) Bond time () = 47.5267 (0.410169)
Neigh time () = 3.6316 (0.0313417) Comm time () = 7654.78 (66.0629)
Outpt time () = 0.0280074 (0.000241712) Other time () = 3772.87 (32.5609)

Then I run it again and got this: about 1.5 hours

Loop time of 5223.37 on 48 procs for 100000 steps with 6646 atoms
Pair time () = 108.588 (2.07888) Bond time () = 47.3986 (0.907433)

Neigh time () = 3.63351 (0.0695625) Comm time () = 2946.81 (56.4158)
Outpt time () = 0.0278377 (0.000532945) Other time () = 2116.92 (40.5278)

As you can see the 4 node result only used about quarter times on the Pair, Bond, Neigh terms compared with 1 node result. ​but the Comm time increased dramatically and can change during different run. I attached the output file from each of these case if you need more information. So my questions are:

(1)Is this increase of Comm time reasonable for Lammps run in parallel?

(2)What could be the cause of this problem, Will it be the HPC does not have a proper setup?

(3)How to debug or solve it?

any advice would be greatly appreciated.

Yueqi

1node.txt (3.88 KB)

4node1.txt (4.2 KB)

4node2.txt (4.18 KB)

Hi, Dear lammps Users,

Generally my question is : Does it take even longer time when lammps run in
parallel sometimes?

Here is my case, I have a 40*40*40 simulation box filled with CH_2
polyethylene atoms,

the total number of C and H are 6646. I am running the script on a HPC, the
HPC has 164 nodes,

each node has 2 hexa-core processor. After defining the potentials and so
on,

then here is my script:

velocity all create 10.0 78456
fix 1 all npt temp 10.0 600.0 100 x 10 10 1000 y 10 10 1000
fix 2 all wall/reflect zlo EDGE zhi EDGE
thermo_style custom step press pxx pyy pzz temp pe c_1
thermo 5000
timestep 0.5
run 100000

When I used just one node with 12 processor, the output time is as below:
About 13 minutes
Loop time of 798.338 on 12 procs for 100000 steps with 6646 atoms
Pair time (\) = 417\.286 \(52\.2694\) Bond time \() = 188.789 (23.6478)
Neigh time (\) = 14\.5956 \(1\.82825\) Comm time \() = 165.235 (20.6973)
Outpt time (\) = 0\.031987 \(0\.0040067\) Other time \() = 12.4002 (1.55326)

When I used 4 node with 48 processor, the output is: about 3 hours
Loop time of 11587.1 on 48 procs for 100000 steps with 6646 atoms
Pair time (\) = 108\.273 \(0\.934429\) Bond time \() = 47.5267 (0.410169)
Neigh time (\) = 3\.6316 \(0\.0313417\) Comm time \() = 7654.78 (66.0629)
Outpt time (\) = 0\.0280074 \(0\.000241712\) Other time \() = 3772.87 (32.5609)

Then I run it again and got this: about 1.5 hours
Loop time of 5223.37 on 48 procs for 100000 steps with 6646 atoms
Pair time (\) = 108\.588 \(2\.07888\) Bond time \() = 47.3986 (0.907433)
Neigh time (\) = 3\.63351 \(0\.0695625\) Comm time \() = 2946.81 (56.4158)
Outpt time (\) = 0\.0278377 \(0\.000532945\) Other time \() = 2116.92 (40.5278)

As you can see the 4 node result only used about quarter times on the Pair,
Bond, Neigh terms compared with 1 node result. but the Comm time increased
dramatically and can change during different run. I attached the output file
from each of these case if you need more information. So my questions are:

(1)Is this increase of Comm time reasonable for Lammps run in parallel?

depends on what kind of HPC system you have.

(2)What could be the cause of this problem, Will it be the HPC does not have
a proper setup?

at 6646 atoms, you have only 554 "owned" atoms per MPI task. even
assuming perfect load distribution, this is not a lot. it is unlikely,
that you can scale this much farther, unless you have a very, *very*
good interconnect. your multi-node performance, hints that you don't
have that, or your MPI compilation does not support it.

(3)How to debug or solve it?

look at the hardware specs of your cluster. if it has "only" gigabit
ethernet, you have no chance. if you compiled your own MPI library,
you may need to compile it with the better interconnect support, if
available, or use the system provided MPI instead. but even then, i
would not expect much speedup.

you may see some speedup using USER-OMP with two or three OpenMP
threads, instead of using all-MPI, but you have to know how to do this
and ideally, you also need to set processor affinity properly. but in
general, for a simple atomic potential, it is unlikely to see much
scaling beyond 500 atoms per CPU core unless you can run inside a
single node.

axel.

Hi, Axel,

First, I will skip that USER-OMP option.

I don't understand what do you mean by "only" gigabit, does it mean 1Gbps?
I checked the HPC spec, and the interconnect is: Infiniband, 40 Gbps,
(1)So with this interconnect, Can I get better speedup?

The HPC spec file can be found here:
http://www.cereo.wsu.edu/bioearth/docs/meetings/20110620_WSU_HPC_details.pdf
And we have the following MPI available in the HPC,
mpich2/1.3.2p1-intel
openmpi/1.4.3_intel
openmpi/1.4.5_intel-sp1.9.293
openmpi/1.6_intel-sp1.9.923
openmpi/1.6_intel_11.1.075
(2) With the above spec, Which MPI would you recommend?

(3)Do you suggest me to just use one node? because I should not expect too much speedup with 500 atoms per core. but I actually run larger systems with 100000 atoms, so how many atoms per core is the most efficient in a parallel run?

Thank you in advance.
Yueqi

Hi, Axel,

First, I will skip that USER-OMP option.

your choice. you may still use the USER-OMP code without OpenMP
enabled, since it includes some general optimizations similar to and
in some cases beyond the OPT package

I don't understand what do you mean by "only" gigabit, does it mean 1Gbps?
I checked the HPC spec, and the interconnect is: Infiniband, 40 Gbps,
(1)So with this interconnect, Can I get better speedup?

i don't know. this is too machine specific. i don't know how your HPC
support people. i don't know about how LAMMPS is compiled. there are a
multitude of possible reasons and there is little that one can do to
debug this via e-mail.

The HPC spec file can be found here:
http://www.cereo.wsu.edu/bioearth/docs/meetings/20110620_WSU_HPC_details.pdf
And we have the following MPI available in the HPC,
mpich2/1.3.2p1-intel
openmpi/1.4.3_intel
openmpi/1.4.5_intel-sp1.9.293
openmpi/1.6_intel-sp1.9.923
openmpi/1.6_intel_11.1.075
(2) With the above spec, Which MPI would you recommend?

talk to your HPC people, they have to know.

(3)Do you suggest me to just use one node? because I should not expect too much speedup with 500 atoms per core. but I actually run larger systems with 100000 atoms, so how many atoms per core is the most efficient in a parallel run?

again, depends on too many things. that is what benchmarks are needed
for and knowing what you are doing. impossible to recommend from
remote. even on the machine that i manage, the best choice depends on
the individual hardware on the compute nodes.

if you want the most efficient, you have to learn about *all* the
details and have a good insight in the hardware. with today's hardware
little things can result in big differences. the time where a simple
recommendation was valid are long gone (more than 10 years at least).

axel.