Load Balancing with Partitions on GPUs

Hi,

I’m running a replica exchange molecular dynamics (REMD) simulation in LAMMPS. I’ve been testing with two replicas, so my simulation uses two partitions. The run command looks like this:

mpirun -np 2 lmp_mpi -sf gpu -pk gpu 2 -partition 2x1 -in input_file.in

My node has two GPUs. I would like to run one partition on each GPU. However, I’m unsure how to force LAMMPS to split the partitions between the GPUs.

Whenever I run, the output of my nvidia-smi shows that both partitions are running on only one GPU.

Thanks,
Nick

This is currently not possible. The code in the GPU package that initializes the GPUs is not aware of multi-partition runs and thus will enumerate GPUs separately for each “world” communicator of each partition.

Here is a suggestion for a (rather hack-ish) workaround for your specific case.
If you create a shell script file under the name lmp_wrap from the script code below:

#!/bin/sh
if [ -n "${MPI_LOCALRANKID}" ]
then \
        export CUDA_VISIBLE_DEVICES=${MPI_LOCALRANKID}
elif [ -n "${OMPI_COMM_WORLD_LOCAL_RANK}" ]
then \
        export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
else \
        echo "Unsupported MPI library"
        exit 1
fi

exec lmp_mpi "$@"

Then you should be able to do:
mpirun -np 2 lmp_wrap -sf gpu -pk gpu 1 -partition 2x1 -in input_file.in

And each LAMMPS process should should “see” only one GPU, but each will “see” a different one.

P.S.: this kind of hack would work for as long as you are using only 1 MPI rank per GPU and should also work with multi-node jobs.

This worked well. The partitions are now being split between the two GPUs. However, I’m only getting about 6% GPU usage, and ~40 timesteps/second compared to the ~136 timesteps/second I get when running without partitions (i.e., non-replica MD on the same simulation with one GPU). I’m guessing the bottleneck is somewhere else. It seems like it could be an MPI problem. Will post progress if I have any.

Thanks Axel!