I am interested in doing implicit solvent (Langevin dynamics) sims in parallel. I would like to simulate thousands of beads for millions+ of timesteps. However, my system is inherently non-uniform in density, as there are voids separating higher density regions where molecules exist. I have found poor scalability, with jobs in fact running slower in parallel than they did on a single processor. Monitoring individual cpu usage confirms that in parallel runs, each cpu is operating at much less than 100%, with each processor running at the same load (eg 16%). My feeling is that when the parallel algorithm divies up the work, most of the beads end up on one or two processors, with the other processors not contributing anything to the effort and the entire effort being slowed by forcing communication between a number of otherwise idle processors. Can anyone support or dispute my intuition on what is happening. I understand that I have offered little in the way of details, but I would be happy to provide any relevent pieces if confronted (LAM-MPI, emt64 dual quad core processors, -ssi rpi tcp, etc).
I know that LAMMPS uses spatial decomposition algorithms. Should I be looking for atom or force-decomposition algorithms? I’m sure this problem has come up before, as Langevin dynamics (non-solvent) sims are common in polymers. However, in my case, I am interested in using many more course-grained atoms than is usual, thus the desire to run in parallel.