How to use MPI in fix external

Dear LAMMPS developer and users:

The current issue is an extension of this post: How to set the value of per-atom variables via python? - LAMMPS / LAMMPS Development - Materials Science Community Discourse

LAMMPS version: 20240829
System: CentOS 7
Python: 3.10.15

Following Axel Kohlmeyer’s advice, I now use fix external to acquire the dynamic external force. There is no bug in serial mode. But when I used MPI for parallel computing, many bugs emerged. The program to compute the external force must be in serial becuase it needs the properties of all the atoms in the system. All the quantities requested or scattered in the callback function have been defined in the LAMMPS input script (so the simulation can run normally in serial). Below are some simple Python callback functions I have tried.

from mpi4py import MPI

comm = MPI.COMM_WORLD
me = comm.Get_rank()

# F1
def callback(caller, step, nlocal, tag, x, fext):
    comm.Barrier()
    if me == 0:
       foo = caller.numpy.extract_compute('foo',LMP_STYLE_GLOBAL,LMP_TYPE_SCALAR)

I ran mpirun -np 2 python3 test.py in terminal, the simulation could not start. It didn’t output any more after printing the time step:
image

A second callback function F2:

def callback(caller, step, nlocal, tag, x, fext):
    comm.Barrier()
    if me == 0:
     n = caller.get_natoms()
     b_info = caller.numpy.extract_compute('bInfo', LMP_STYLE_LOCAL, LMP_TYPE_ARRAY)

For F2, mpirun -np 2 or -np 4 actually ran very well (suprisingly), but -np 6 and more processors raised the error:

Exception ignored on calling ctypes callback function: <function lammps.set_fix_external_callback.<locals>.callback_wrapper at 0x7fa204c9a290>
Traceback (most recent call last):
  File "lammps/core.py", line 2236, in callback_wrapper
    callback(caller, ntimestep, nlocal, tag, x, f)
  File "test.py", line 24, in callback
    b_info = caller.numpy.extract_compute('bInfo', LMP_STYLE_LOCAL, LMP_TYPE_ARRAY)
  File "lammps/numpy_wrapper.py", line 186, in extract_compute
    return self.darray(value, nrows, ncols)
  File "lammps/numpy_wrapper.py", line 495, in darray
    ptr = cast(raw_ptr[0], POINTER(c_double * nelem * dim))
ValueError: NULL pointer access

F3

def callback(caller, step, nlocal, tag, x, fext):
    comm.Barrier()
    if me == 0:
       var = np.zeros(10)
       caller.scatter_atoms('d_strain', 1, 1, (var.size * c_double)(*var))

For F3, the bug is the same as that of F1, namely there was no more output after printing the time step.

F4 removed if me == 0:

def callback(caller, step, nlocal, tag, x, fext):
    comm.Barrier()
    foo = caller.numpy.extract_compute('foo',LMP_STYLE_GLOBAL,LMP_TYPE_SCALAR)
    n = caller.get_natoms()
    b_info = caller.numpy.extract_compute('bInfo', LMP_STYLE_LOCAL, LMP_TYPE_ARRAY)
    var = np.zeros(10)
    caller.scatter_atoms('d_strain', 1, 1, (var.size * c_double)(*var))

The simulation ran well for -np 2 but the results are not as expected. For -np 4 and more, the bug is the same as F1 and F3.

If comm.Barrier was removed, the bug is the same as that of F1 and F3 regardless of -np.

Another question I am pondering is that the arg fext is per-processor, but the external forces I compute are for all the processors…

That is a bad design for any feature that is to be used in conjunction with LAMMPS, as this will massively interfere with parallel efficiency. In LAMMPS all per-atom data is distributed across subdomains which is what allows to scale for very large systems and very many atoms. Even topology data is associated with atoms and then propagated accordingly. Any kind of global property will require explicit parallel programming and then communication to collect it from all subdomains.

Thus to make such a feature work, significant parallel programming needs to be added where you collect and distribute data as needed.

From the statements above and the fact that fext is of length “nlocal”, it should be obvious that fext is allocated and needs to be written to on a per subdomain basis and only contain the forces of the corresponding local atoms. If your data is global, you need to add MPI calls to properly distribute it, e.g. via a broadcast of the natoms size array and then using the atom’s tag property to look up its values in the broadcasted array for the local atoms.

1 Like

Whoops, this is not easy work. Will take some time to study parallel programming. Anyway, thanks for such a quick reply.

After one day’s debugging, MPI can be used normally now for the coupling. The computation is much faster.

In case someone else wants to carry out a similar coupling, the callback functions I used are listed below:

def callback1(caller, step, nlocal, tag, x, fext):
    x_all = comm.gather(np.hstack((x, tag[:, np.newaxis])), root=0)
    res = None
    if me == 0:
        tags = [x_tag[:, -1].astype(int) - 1 for x_tag in x_all]
        x_all = np.vstack(x_all)
        res = ...
        f = np.zeros((lmp.get_natoms(), 3))
        fs = [f[tag] for tag in tags]
    res = comm.bcast(res, root=0)
    fext[:] = comm.scatter(fs, root=0)
    comm.Barrier()
    caller.scatter_atoms_subset('d_res', 1, 1, n, idC, (n * c_double)(*res))

This callback function also works for me:

def callback2(caller, step, nlocal, tag, x, fext):
    x_all = caller.gather_atoms_subset('x', 1, 3, n_wall, id_wallC)
    tags = comm.gather(tag - 1, root=0)
    res = None
    if me == 0:
        # x_all and tags are ready to use
        res = ...
        f = np.zeros((lmp.get_natoms(), 3))
        fs = [f[tag] for tag in tags]
    res = comm.bcast(res, root=0)
    fext[:] = comm.scatter(fs, root=0)
    comm.Barrier()
    caller.scatter_atoms_subset('d_res', 1, 1, n, idC, (n * c_double)(*res))

Here is the performance comparison:

-np 1 -np 8
callback1 0.2368 sec/step 0.0535 sec/step
callback2 0.2347 sec/step 0.0516 sec/step
1 Like