Questions on communication implementational details in LAMMPS

Dear LAMMPS Developers and Users,
As I was going through lammps source code(version 10 Aug 2015), few things caught my attention and questions regarding them are as follows:

  • the communication steps in lammps seem to be using normal communication method (as seen from functions in comm_brick.cpp) where each processor communicates with all its 6 stencil neighbors and doesn’t seem to implement neutral territory methods (http://people.csail.mit.edu/rondror/papers/bowers_SciDAC_05.pdf) as implemented by most MD packages. Some of these methods not only reduce communication to only 3 stencil neighbors (as in eighth shell method) and improve load balancing, but have been shown to scale asymptotically faster than standard communication model. Is there some reason to this or am I missing something?
  • In the same file (comm_brick.cpp), for the function CommBrick::exchange, in the comments it is written that :
    " atoms will be lost if not inside a stencil proc’s box
    can happen if atom moves outside of non-periodic bounary
    or if atom moves more than one proc away "Why can’t we account for atoms that have moved more than 1 processor away, maybe by using multiple communication steps, something similar to what is done in CommBrick::forward_comm function?
  • Most MD packages fail to scale to higher number of cores because communication begins to become a bottleneck and from my runs of lammps benchmarks same has come forward (in strong and weak scaling sense). The best automatic topology aware mapping used by lammps is in numa style of grid mapping of processors to reduce off node communication (as described in lammps documentation). Does lammps plan to use better topology aware mapping of MPI ranks to physical processors taking into account proximity of nodes w.r.t. each other and possibly also incorporate for traffic in the underlying network (probably by conducting some MPI latency and badwidth tests or using MPI topolgies (eg:http://www.mcs.anl.gov/~balaji/pubs/2011/eurompi/eurompi11.mpi-topology.pdf))). This process to processor mapping can also be changed dynamically. Do such implementations exist for LAMMPS or have I overlooked?
    I have tried to search the mailing lists for these questions. If I have missed, please direct me to appropriate link or material.

Thank you in advance.

Comments below.

Steve

Thank you for clarification…