[lammps-users] Question about the communication time

_Yajie_Lei · December 3, 2010, 7:50pm

Dear Lammps users,

I found in the same system.

If I use

pair_style lj/cut/coul/long 12.0
kspace_style pppm 1.0e-4
kspace_modify order 4

The comm time (%) = 82.32 (1.58896)

If I just use
pair_style lj/cut/coul/cut 12.0

The comm time (%) = 3276.9 (73.0941)

I am wondering why the comm time changed so dramatically? If so, it can not save much time if I shift from pppm to direct pairwise calculation.

Thank you very much for any advice. Have a great weekend.

Best
Yajie

akohlmey · December 3, 2010, 8:27pm

Dear Lammps users,

I found in the same system.

If I use

pair_style lj/cut/coul/long 12.0
kspace_style pppm 1.0e-4
kspace_modify order 4

The comm time (%) = 82.32 (1.58896)

If I just use
pair_style lj/cut/coul/cut 12.0

The comm time (%) = 3276.9 (73.0941)

I am wondering why the comm time changed so dramatically? If so, it can not
save much time if I shift from pppm to direct pairwise calculation.

these numbers don't make any sense unless something
else has happened outside of what lammps is doing.

have both calculations been run under _exactly_
the same (reproducible) boundary conditions?

do you have exclusive access to the compute nodes?

what kind of interconnect do you have?

axel.

_Yajie_Lei · December 3, 2010, 8:40pm

Hi Axel,

Thanks for your reply.

Yes, exactly PBC, simulation box, system and computing processors. No, I can not access to the compute nodes. And I am not sure what your mean the interconnect.

Could you please give me more information about this? I can check more with the person charging the machine. What’s the main factors influence the communicate time? Thanks.

Best
Yajie

akohlmey · December 3, 2010, 8:55pm

Hi Axel,

Thanks for your reply.

Yes, exactly PBC, simulation box, system and computing processors. No, I can
not access to the compute nodes.

that is not what my question was about. rather i wanted to know,
if your job is the only calculation going on those nodes.
one way to get this kind of timings is when some _other_
job or process is still consuming bandwidth or cpu time.

And I am not sure what your mean the interconnect.

what kind of network do you use for parallel computing
(gigabit ethernet, myrinet, infiniband, shared memory)?

overall, this has to be either an issue of your hardware
(perhaps one of the nodes in the cutoff-only calculation
has a defective network card and your MPI library has
dropped you to a slower communication mode).

Could you please give me more information about this? I can check more with
the person charging the machine. What's the main factors influence the
communicate time? Thanks.

there are many details that can matter, many of which are
pertinent on how the machine you are using is set up,
maintained and operated. others relate to how you have
compiled and use lammps.

without any knowledge of the details, it is very difficult
to track these things down. the best way is to first talk
with your local support staff and (re-)run some more
tests to see if the timings you are getting are reproducible
and then try to track down the cause.

as simple test that you can do up front would be a
strong scaling benchmark to identify at which node
count this communication time increase happens.

cheers,
axel.

_Yajie_Lei · December 3, 2010, 9:44pm

Hi Axel,

Thank you very much for your information. Yes, my job is the only calculating on those nodes. I will test more according to your suggestion.

Have a great weekend.

Best
Yajie

sjplimp · December 4, 2010, 2:29pm

If FFTs are very inefficient on your box, it's possible they
slow things up and the resulting timings on different processors
can be skewed due to some processors waiting for others
to finish. Are the total timings consistent with a short-range
vs long-range calculation? If so, I wouldn't worry about the
timing breakdown.

Steve

_Yajie_Lei · December 6, 2010, 8:31pm

Hi Steve,

Thanks a lot for your reply.

For both the total timing and “Pair time”, the direct short-range lj/cut/coul/cut is only about 20% faster than the long-range lj/cut/coul/long. Do you think it’s OK? I thought the short-range summation should have been much faster, because it would not do K-space calculation. But the short-range summation spend dramatically increased time on “Comm time” compared with the long-range one.

If FFTs are not inefficient, is there any way to make it better in LAMMPS?

Most appreciate for your suggestion.

Best
Yajie

akohlmey · December 7, 2010, 9:07am

yajie,

Hi Steve,

Thanks a lot for your reply.

For both the total timing and "Pair time", the direct short-range
lj/cut/coul/cut is only about 20% faster than the long-range
lj/cut/coul/long. Do you think it's OK? I thought the short-range summation

that may or may not be reasonable. what fraction of time is spent
on the kspace part is _very_ dependent on the individual simulation
system, the number and type of nodes/processors/network that you
use and how well that machine is managed or operated.

should have been much faster, because it would not do K-space calculation.
But the short-range summation spend dramatically increased time on "Comm
time" compared with the long-range one.

that may also be an indication of a load balance issue. the way
timings are accumulated in lammps assumes that the load is
balanced evenly between nodes. to get more accurate timings,
you have to add synchronizations to the timing code, that will hinder
parallel performance.

If FFTs are not inefficient, is there any way to make it better in LAMMPS?

it is not the FFTs by itself, that are the issue, but the communication
associated with doing them in 3d and in parallel. i am certain that
steve misread your original post, so his comment doesn't really
apply. please note the double negative in your statement.

Most appreciate for your suggestion.

before we can have a meaningful discussion of what is going on,
you need to provide more detail about the hardware you are running
on, how you are running your job and what kind of calculation you
are running. what might also be good, would be a comparison with
calculations on other machines. if you don't want to make your input
available, there are a number of benchmark inputs available
with published performance data. you should run those on your machine
and compare the performance numbers.

http://lammps.sandia.gov/bench.html
http://sites.google.com/site/akohlmey/software/lammps-benchmarks

cheers,
axel.

_Yajie_Lei · December 7, 2010, 2:41pm

Hi Axel,

Thanks a lot for your comments. I will check more.

Best
Yajie