Bug in LAMMPS timer?

I think I may have found a bug in the timer routine. I did some bug
hunting and it seems that some communication time may be mislabeled as
kspace time.

I have a system where I noticed the reported kspace time:

1. Doesn't scale well with number of processors
2. Depends on the cutoff radius (even though *everything* else is
held constant)

I ran an LJ system with kspace_style Ewald/n on 1 processor and got
the following:

Loop time of 31.3689 on 1 procs for 100 steps with 10000 atoms
Pair time (\) = 20\.8518 \(66\.4728\) Kspce time \() = 8.28053 (26.3973)
Neigh time (\) = 2\.19111 \(6\.98499\) Comm time \() = 0.0381899 (0.121745)
Outpt time (\) = 0\.00127912 \(0\.00407766\) Other time \() = 0.00599742 (0.019119)

I then ran the same system with 2 processors and the pair time
decreased but the kspace time stayed almost the same (which doesn't
make any sense):

Loop time of 20.758 on 2 procs for 100 steps with 10000 atoms
Pair time (\) = 10\.6957 \(51\.5259\) Kspce time \() = 8.1981 (39.4937)
Neigh time (\) = 1\.1067 \(5\.33144\) Comm time \() = 0.752884 (3.62695)
Outpt time (\) = 0\.000813723 \(0\.00392004\) Other time \() = 0.00376761 (0.0181501)

I then commented out the following line in ewald_n.cpp:

//MPI_Allreduce(cek_local, cek_global, 2*n, MPI_DOUBLE, MPI_SUM, world);

and I got

Loop time of 20.8964 on 2 procs for 100 steps with 10000 atoms
Pair time (\) = 10\.5278 \(50\.3811\) Kspce time \() = 4.1784 (19.9958)
Neigh time (\) = 1\.10506 \(5\.28826\) Comm time \() = 5.08053 (24.3129)
Outpt time (\) = 0\.000882268 \(0\.0042221\) Other time \() = 0.0037092 (0.0177504)

This makes more sense. This run took approximately the same time as
the one above but the communication time is higher and kspace time is
lower.

I then ran the same system with 2 processors and "kspace_modify
compute no" and got:

Loop time of 15.4689 on 2 procs for 100 steps with 10000 atoms
Pair time (\) = 10\.5201 \(68\.0083\) Kspce time \() = 0 (0)
Neigh time (\) = 1\.10849 \(7\.16592\) Comm time \() = 3.83572 (24.7963)
Outpt time (\) = 0\.000850201 \(0\.0054962\) Other time \() = 0.00370133 (0.0239276)

The communication time is also higher than in the second run. A
similar trend is seen for 8 processors.

My test system is attached and I can provide more evidence if needed.

Stan

in.lj_cold (1.08 KB)

in.lj_hot (389 Bytes)

stan,

two general comments.

1) communication that is part of a force style
    is counted as part of that style and not as
    part of communication. see verlet.cpp

2) by default, the timings collected by LAMMPS
   are not synchronized and thus subject to errors,
   when you have load imbalances. you can synchronize
   them by uncommenting the MPI_Barrier() call in
   Timer::stamp() of timer.cpp. the call your commented
   out is an implicit synchronization, forcing wait time
   to be counted in kspace that would otherwise be
   accounted for in communication. see verlet.cpp
   synchronization makes timings more consistent,
   but also reduces performance.

in short, timings in LAMMPS can be off by quite a
bit, due to the way how it is measured.

axel.

An additional comment on Axel's point (2).
There are several places in verlet.cpp where
accumulated time from one category can bleed
over into another category due to load-imbalance
and not syncrhnonizing. Communication is one
such common sink for timing, since procs
have to wait for others to catch up due to
implicit synciing in the communication operation.
Hence load-imablance time can be printed as
comm time.

Steve