large "comm" and "other" time

Hi fellow LAMMPS users,
I have a general (hopefully not vague) question regarding the “comm” and “other” time:
could a spatially uneven neighbor-listing affect the MPI communication and other performances?

Recently I’ve been running simulations where only a small, localized portion of atoms are active and the rest of them remain “frozen”. Although the pair time is reduced, I gained poorer performance in “comm” and “other” time compared to the exact same input with all atoms’ interactions turned on. All the fixes and computes are the same except for one extra fix_nve/noforce for the frozen part, which I think is hardly making any difference.

My suspicion is that the highly uneven distribution of neighbors causes the comm between procs a lot more cumbersome but that does not quite explain the “other” time.
Any thoughts or suggestions?

Thanks!

Hi fellow LAMMPS users,
I have a general (hopefully not vague) question regarding the "comm" and
"other" time:
could a spatially uneven neighbor-listing affect the MPI communication and
other performances?

Recently I've been running simulations where only a small, localized portion
of atoms are active and the rest of them remain "frozen". Although the pair
time is reduced, I gained poorer performance in "comm" and "other" time
compared to the exact same input with all atoms' interactions turned on. All
the fixes and computes are the same except for one extra fix_nve/noforce for
the frozen part, which I think is hardly making any difference.

sorry, but that doesn't make too much sense.
if you the fix just skips interactions that are
already computed, it does *not* change the
neighbor lists. all interactions are *still* computed.

My suspicion is that the highly uneven distribution of neighbors causes the
comm between procs a lot more cumbersome but that does not quite explain the
"other" time.
Any thoughts or suggestions?

it is not always easy to tell the origin of what is different and not working.
but there could be a problem in your communication hardware, i.e. MPI
communication falling back from a low-latency high-performance communication
to using (high latency) TCP/IP due to a failed link.

other than that, you could indeed have a load balance problem, but that
would be due to the particle *distribution*, not due to ignoring their
interactions
with fix nve/noforce. this can be addressed by manually setting the processor
distribution and using the load balancing frame work (cf. processors and balance
commands).

in any case, it is difficult to discuss these issues without having a
(small but) representative example demonstrating the differences you see,
so that somebody can confirm this independently.

axel.

Hi Axel,
One thing I haven’t made clear is that I did exclude the neighbor-list calculation among the “frozen” atoms. The strange part is that the comparing group (all interactions turned on) with the same particle distribution has much less “comm” and “other” time.

Do you think it would be necessary to specifically list all the fixes and computes to demonstrate this issue?

Thanks,

Hi Axel,
One thing I haven't made clear is that I did exclude the neighbor-list
calculation among the "frozen" atoms. The strange part is that the comparing
group (all interactions turned on) with the same particle distribution has
much less "comm" and "other" time.

not strange at all. the timer for "pair" is stopped right after that
part of the calculation is finished on each node individually.
however, the communication is a synchronization point.
if you have a load imbalance, e.g. through leaving out some
interactions that don't contribute, the processes will wait
while the comm timer is running.

Do you think it would be necessary to specifically list all the fixes and
computes to demonstrate this issue?

no. worst of all, i don't think that we have a good handle on
this, since the load balancing algorithm is entirely based
on particle counts. there is currently no real detection of wait
times that are due to load imbalance. if you want more
accurate accounting of the timing into the various phases
of the MD steps, you'd have to add more synchronization
points, but that would make LAMMPS run less efficient.

axel.