Setting up MPI communicator groups for the LAMMPS python wrapper

SourabBarathVij · May 23, 2024, 2:23pm

Hello everyone,

I’m using LAMMPS version 17 Apr 2024 on a linux machine with Python 3.12.3. I’m writing a script that takes ranges from a user and initiates simulations with random values - this is what I intend to pass to the LAMMPS object.

I’m a beginner to MPI programming, so I’m looking for a general direction here to get started with my search.

I have a python script that splits MPI.Comm into individual communicators for each MPI process I have, with one instance of LAMMPS initialised for every slave process.

    rank = MPI.COMM_WORLD.Get_rank()
    comm = MPI.COMM_WORLD
    status=MPI.Status()
    if rank == 0 : color = MPI.UNDEFINED
    else: color = rank
    
    print(f"Colour of process {rank} is {color}")
    com_split = comm.Split(color, key=rank)

With a call to the LAMMPS function done on slave ranks as

obj_recv = comm.recv(source = 0, tag=MPI.ANY_TAG, status=status)
return_values = Child_init(position=obj_recv[1], file_path=obj_recv[2], comm = com_split, kwargs=discrete_dict)

I have a fundamental question on splitting communicators and passing information to individual groups.

I would like to group the slave processes and pass them to LAMMPS so that I can run each simulation on multiple processes (let’s say 4). I have no troubling passing the groups to LAMMPS as a communicator and getting it to initialise. (In this case I’m communicating the information as detailed above, only to ‘process 0’ in each group). However, when the initiated LAMMPS instance tries to read a file, it freezes.

My question is, how would I have to go about setting up the communicator to pass this information to the LAMMPS instances. Should I create groups and create multiple intercommunicator to pass the object to each group? Where can I find guidance on dealing with MPI communicators?

Thank you

akohlmey · May 23, 2024, 3:24pm

There is very little tutorial material for mpi4py, but you can easily use tutorials for C or Fortran and adapt them once you understand how the names of the MPI functions are mapped to the mpi4py classes.

Here is a simple example that sets up a manager/worker system with global rank 0 dispatching data to the sub-communicators. For that the global communicator is split into intacommunicators for the LAMMPS instances (with global rank zero not starting one). Then it also creates a list of intercommunicators between global rank 0 and rank 0 of the split communicators.

The intercommunicators are then used to send a string from global rank 0 to local rank 0 and then local rank zero broadcasts it within its intracommunicator.

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

procs_per_mpi = 4
ncomms = int(size / 4)

#print("ncomms = ", ncomms)

if rank == 0:
    color = MPI.UNDEFINED
else:
    color = int((rank - 1) / procs_per_mpi)

# generate communicators for LAMMPS instances
lmpcomm = comm.Split(color)
lmprank = MPI.UNDEFINED
if rank > 0:
    lmprank = lmpcomm.Get_rank()
#    print("global rank %d  color %d local rank %d" % (rank, color, lmprank))

# generate communicators for sending data from global rank 0 to local rank 0
intercomm = []
for i in range(0,ncomms):
    intercolor = MPI.UNDEFINED
    if lmprank == 0 and color == i:
        intercolor = 1
    if rank == 0:
        intercolor = 1
    intercomm.append(comm.Split(intercolor))

irank = MPI.UNDEFINED
for icomm in intercomm[0:ncomms]:
    if icomm != MPI.COMM_NULL:
        irank = icomm.Get_rank()
        isize = icomm.Get_size()
        #print("global rank %d  local rank %d local size %d" % (rank, irank, isize))

# send strings to rank 0 of groups then broadcast within group
if rank == 0:
    files = ['file1', 'file2', 'file3', 'file4']
    for i in range(0,ncomms):
        val = intercomm[i].send(files[i], dest = 1)
else:
    if intercomm[color] != MPI.COMM_NULL:
        val = intercomm[color].recv(source = 0)
    else:
        val = ''
    val = lmpcomm.bcast(val, root = 0)
    print("global rank %d  local rank %d  val = %s" % (rank,lmprank, val))

If I run the command: mpirun -np 16 python split_comms.py | sort -n -k 3 I get the output:

global rank 1  local rank 0  val = file1
global rank 2  local rank 1  val = file1
global rank 3  local rank 2  val = file1
global rank 4  local rank 3  val = file1
global rank 5  local rank 0  val = file2
global rank 6  local rank 1  val = file2
global rank 7  local rank 2  val = file2
global rank 8  local rank 3  val = file2
global rank 9  local rank 0  val = file3
global rank 10  local rank 1  val = file3
global rank 11  local rank 2  val = file3
global rank 12  local rank 3  val = file3
global rank 13  local rank 0  val = file4
global rank 14  local rank 1  val = file4
global rank 15  local rank 2  val = file4

which confirms that it is doing what it is supposed to do.
Using this little demo as minimal example, you should be able to implement what you want to do by adapting and expanding it.

SourabBarathVij · May 23, 2024, 7:27pm

Hello @akohlmey, thanks a ton for your prompt response on writing an example on how to set up inter and intracommunicators.

I ported the communicator initialisation and management into my code, and tested the intercommunicator with random integers, and the communication between the Master and Slave processes are sound.

However, when I try to instantiate a LAMMPS object, my terminal gets flooded with errors (despite LAMMPS running successfully in the background).

This is what my main function looks like:

Here, the variable com_split is the variable lmpcomm from your example. I retained an existing name from my script due to many references to it.

if rank == 0:
        main(intercomm)
    else:
        #print(f"global rank {rank}, colour {color}, local rank {lmprank}")
        while (1): 
            obj_recv = ()
            #print(f"evalulating function from rank = {rank}")
            if intercomm[color] != MPI.COMM_NULL:
                obj_recv = intercomm[color].recv(source = 0, tag=MPI.ANY_TAG, status=status)
                print(f"received particle at {color} succesfully")
                tag = status.Get_tag()
                if tag == 200:
                    print("deathsignal")
                    intercomm[color].Free()
                    break
                
            obj_recv = com_split.bcast(obj_recv, root = 0)
            dielectric_consts = [70] 
            densities = [2.115] 
            discrete_dict = {dielectric_consts[i] : densities[i] for i in range(len(dielectric_consts))}
            ret_cost = Child_init(position=obj_recv[1], file_path=obj_recv[2], comm = com_split, kwargs=discrete_dict)
            if lmprank == 0:
                obj_to_send = (obj_recv[0], ret_cost)
                if intercomm[color] != MPI.COMM_NULL:
                    intercomm[color].send(obj_to_send, dest=0)

I tried calling LAMMPS only from rank 0 of the communicator by moving the calling function into the final if, in this manner:

            if lmprank == 0:
                dielectric_consts = [70]
                densities = [2.115] 
                discrete_dict = {dielectric_consts[i] : densities[i] for i in range(len(dielectric_consts))}
                ret_cost = Child_init(position=obj_recv[1], file_path=obj_recv[2], comm = com_split, kwargs=discrete_dict)
                obj_to_send = (obj_recv[0], ret_cost)
                
                
                if intercomm[color] != MPI.COMM_NULL:
                    intercomm[color].send(obj_to_send, dest=0)

In this case, LAMMPS froze upon trying to read a file, just as I detailed in the initial post.

This is what a part of my terminal looks like, with the error:

Setting up cg style minimization ...
  Unit style    : micro
  Current step  : 8711
Per MPI rank memory allocation (min/avg/max) = 154.2 | 154.2 | 154.3 Mbytes
   Step          Temp          E_pair         E_mol          TotEng         Press          Volume    
      8711   0              501.80123      0              501.80123      801.69008      6.1585288    

[sc-calc34:110629] Read -1, expected 452064, errno = 14
[sc-calc34:110629] Read -1, expected 78344, errno = 14
[sc-calc34:110629] Read -1, expected 75344, errno = 14
[sc-calc34:110629] Read -1, expected 78344, errno = 14
[sc-calc34:110629] Read -1, expected 75344, errno = 14
[sc-calc34:110629] Read -1, expected 13520, errno = 14
[sc-calc34:110629] Read -1, expected 17200, errno = 14
[sc-calc34:110629] Read -1, expected 1449472, errno = 14
[sc-calc34:110629] Read -1, expected 1407976, errno = 14
[sc-calc34:110629] Read -1, expected 467616, errno = 14
[sc-calc34:110629] Read -1, expected 452736, errno = 14
[sc-calc34:110629] Read -1, expected 77936, errno = 14
[sc-calc34:110629] Read -1, expected 75456, errno = 14

I’d appreciate any insight, thank you!

akohlmey · May 23, 2024, 7:31pm

Sorry, but I have no time to debug your code.

I have provided you an example to get you started, but the tedious task of debugging your work is your job. There is not enough information provided anyway.

stamoor · May 23, 2024, 7:41pm

errno 14 means it couldn’t find your file. Double check your paths to the file.

SourabBarathVij · May 23, 2024, 11:36pm

Thank you for your assistance with helping me get started @akohlmey. I’ve determined the source of the problem

Yes I was initially lost because the program worked well before I parallelised it with MPI.

The issue was that I was passing communicator groups to a function, which then instantiated a child process using the multiprocess library with a pipe back to the parent process. As a result, one set of these processes were trying to access addresses that did not exist anymore.

This brings me to my next question. Is there any way for me to handle MPI Abort signals and Segmentation faults(thrown due to neighbour list overflows and invalid memory access)?

I know these aren’t desirable, but as a result of the automation script, the simulation sometimes tends to receive inputs that cause either of these faults to occur, and they bring the entire interpreter down.

akohlmey · May 24, 2024, 12:49pm

If you are using the latest feature releases or compile LAMMPS with -DLAMMPS_EXCEPTIONS=on, you can catch the errors (most C++ exceptions are caught by the library interface and rethrown as python exceptions). However, there is no clean mechanism to have exception handling across MPI and thus this cannot work in meaningful way unless you disable MPI for LAMMPS and compile a serial library and executable. After an exception the MPI communicator(s) are in an undefined state and MPI stipulates that you can initialize and finalize MPI exactly once during a run and you would have to stop the executable and start over entirely.

It is almost never a good idea to combine MPI and the multiprocessor module in python.

SourabBarathVij · May 25, 2024, 7:36am

Yes I catch normal crashes this way, but neighbour list overflows and Segmentation faults aren’t handled by this exception handler.

Yes this makes sense. Can you recommend if spawning child processes using MPI_Spawn might be a good workaround for this, since the processes would have their own COMM_WORLD and ideally shouldn’t propagate the MPI_ABORT signal to all ranks?

Yes, I’ve come to accept this, and I’m running LAMMPS with just OpenMP threading. However since I use a Granular pair style, the acceleration I receive from OMP over serial execution is meagre. With a 200-300% difference between running each individual instance on MPI 4 vs OpenMP 4 (CPU utilisation in the latter is only 150%). Hence, I’d like to work around this if there is a possibility. Otherwise I’ll stick to the crash safe process of running it under a child process as I am right now.

Thank you for your insight!

akohlmey · May 25, 2024, 9:35am

Neither should happen when conservative choices are made.

I cannot recommend this.

Granular pair styles are the worst to accelerate (also for GPUs) and specifically the neighbor list build is not multi-threaded in this case.

SourabBarathVij · May 25, 2024, 6:37pm

Yes I’m aware. However my use case requires such choices to be made and handled, so it is a compromise I have to make.

Noted, I’ll pass on the MPI grouping.

Interesting. I’ll read up on this when I get the time. Thank you again

stamoor · May 29, 2024, 11:22pm

I think we get decent speedup for the chute benchmark with the KOKKOS package on a GPU, and the neighborlist build is multi-threaded in that case. But not all the styles you are using may be ported yet.

SourabBarathVij · May 31, 2024, 7:57am

I just checked, and I can confirm. With 1 MPI task and CUDA acceleration provided by an NVIDIA RTX ADA 3500 GPU, I observe a significant acceleration from using KOKKOS over pure mpi execution.

However, my system uses a hybrid atom style and hybrid pair style. Both of which don’t seem to be compatible with KOKKOS as it is implemented right now.

stamoor · May 31, 2024, 2:58pm

Hybrid pair styles work with Kokkos, hybrid atom styles work as well but may not give optimal performance. Should be better in the future.

SourabBarathVij · June 3, 2024, 11:31am

You’re right, I don’t know what I did wrong last time, that LAMMPS crashes with the error : KOKKOS doesn’t support hybrid atom styles.
It works now, with my system - I’m benchmarking the acceleration I get with KOKKOS vs other methods