[lammps-users] parallel runs on non-uniform density systems (Langevin coarse-graining)

I am interested in doing implicit solvent (Langevin dynamics) sims in parallel. I would like to simulate thousands of beads for millions+ of timesteps. However, my system is inherently non-uniform in density, as there are voids separating higher density regions where molecules exist. I have found poor scalability, with jobs in fact running slower in parallel than they did on a single processor. Monitoring individual cpu usage confirms that in parallel runs, each cpu is operating at much less than 100%, with each processor running at the same load (eg 16%). My feeling is that when the parallel algorithm divies up the work, most of the beads end up on one or two processors, with the other processors not contributing anything to the effort and the entire effort being slowed by forcing communication between a number of otherwise idle processors. Can anyone support or dispute my intuition on what is happening. I understand that I have offered little in the way of details, but I would be happy to provide any relevent pieces if confronted (LAM-MPI, emt64 dual quad core processors, -ssi rpi tcp, etc).

I know that LAMMPS uses spatial decomposition algorithms. Should I be looking for atom or force-decomposition algorithms? I’m sure this problem has come up before, as Langevin dynamics (non-solvent) sims are common in polymers. However, in my case, I am interested in using many more course-grained atoms than is usual, thus the desire to run in parallel.

How many processors, how many particles? You are correct
that non-uniform density will be a problem for LAMMPS. The
processors command might be helpful depending on your
geometry; it allows you to specify the layout of procs. But if
you have only a few 1000 particles, then you won't be able
to efficiently use lots of procs no matter what you do.

You could try atom or force decomp, but I know of no
full-featured code that does it.