Load Balancing with Partitions on GPUs

ntiwari · August 16, 2021, 8:08pm

Hi,

I’m running a replica exchange molecular dynamics (REMD) simulation in LAMMPS. I’ve been testing with two replicas, so my simulation uses two partitions. The run command looks like this:

mpirun -np 2 lmp_mpi -sf gpu -pk gpu 2 -partition 2x1 -in input_file.in

My node has two GPUs. I would like to run one partition on each GPU. However, I’m unsure how to force LAMMPS to split the partitions between the GPUs.

Whenever I run, the output of my nvidia-smi shows that both partitions are running on only one GPU.

Thanks,
Nick

akohlmey · August 16, 2021, 8:16pm

This is currently not possible. The code in the GPU package that initializes the GPUs is not aware of multi-partition runs and thus will enumerate GPUs separately for each “world” communicator of each partition.

akohlmey · August 16, 2021, 9:20pm

Here is a suggestion for a (rather hack-ish) workaround for your specific case.
If you create a shell script file under the name lmp_wrap from the script code below:

#!/bin/sh
if [ -n "${MPI_LOCALRANKID}" ]
then \
        export CUDA_VISIBLE_DEVICES=${MPI_LOCALRANKID}
elif [ -n "${OMPI_COMM_WORLD_LOCAL_RANK}" ]
then \
        export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
else \
        echo "Unsupported MPI library"
        exit 1
fi

exec lmp_mpi "$@"

Then you should be able to do:
mpirun -np 2 lmp_wrap -sf gpu -pk gpu 1 -partition 2x1 -in input_file.in

And each LAMMPS process should should “see” only one GPU, but each will “see” a different one.

P.S.: this kind of hack would work for as long as you are using only 1 MPI rank per GPU and should also work with multi-node jobs.

ntiwari · August 17, 2021, 3:59pm

This worked well. The partitions are now being split between the two GPUs. However, I’m only getting about 6% GPU usage, and ~40 timesteps/second compared to the ~136 timesteps/second I get when running without partitions (i.e., non-replica MD on the same simulation with one GPU). I’m guessing the bottleneck is somewhere else. It seems like it could be an MPI problem. Will post progress if I have any.

Thanks Axel!

tomasfbouvier · December 20, 2024, 4:24pm

Hi, was this fixed? I’m encountering a similar issue. When splitting among the same gpu node the performance is lower than half the performance when not splitting at all. This decay is not happening when using cpu cores instead.

srtee · December 21, 2024, 3:23am

Please start a separate thread (for one, you need to provide us with more detailed information to assess the situation).