Right now, I’m trying to split an initial group of processors into smaller groups and run a separate instance of LAMMPS on each group. I am also running ‘blocks’ of timesteps, using a loop over run X commands, and doing some additional work with various output parameters between loop iterations.
My code is set up similarly to the simple.cpp example in the COUPLE example directory, i.e. I assign processor group numbers according to some scheme, split them using MPI_Split, and then create a separate LAMMPS object for each new local communicator and read in the corresponding start file, then send a series of run commands.
The issue that I’m running into is that my code is mysteriously slow when doing this. I have a single-group test version of this code that uses the entire set of processors without any splits, and runs all the steps at once instead of looping over blocks, and this runs in a little over a minute for the parameters of the simple test set.
However, using an equivalent processor decomposition (2 total procs and 1 group vs. using 2 procs w/o grouping) and this MPI_Split approach, my code doesn’t even finish after 10-15 minutes. I do have output to prove that the timesteps are progressing, so it isn’t a simple deadlock as far as I can tell, and my best guess is that I’m somehow splitting the system improperly and forcing different instances of LAMMPS to compete for the limited resources on my test machine. Even so, I’m not sure why this would happen, or how to prove that this is what’s causing the trouble, so I’d appreciate suggestions or ideas if anyone has run into this sort of problem before. In particular, I’m wondering if I’m missing some aspect of opening multiple LAMMPS instances, since the example file only divides the universe of processors into a LAMMPS and non-LAMMPS set.