Simulation gets slower after multiple loops using a custom compute

MoseyQAQ · October 18, 2024, 7:44pm

Hi!

I’m developing a custom compute command that calculates atom displacements using a neighbor list from an external file. This file lists the global IDs of central atoms and their neighbors. The displacement measures how far the central atom is from the center of its neighboring atoms.

In my custom MD simulations, I will run multiple loops until a certain condition is met. Here is key part of my input:

variable        j loop ${nloop}
label           loop_mark
 
... # some variable define

thermo_style    custom step temp etotal pe ke vol lx ly lz
run             20

... # some variable define 

if "some condition" then "jump  SELF break"
next            j
jump            SELF loop_mark

label           break

I’ve noticed that each loop takes longer to complete over time.

In my implementation, I called atom->init() in void ComputeCustomDisp::init(), and when looping over all atoms in void ComputeCustomDisp::compute_peratom() function, I used atom->map(global index) to get local index on-the-fly:

void ComputeCustomDisp::init()
{
    // read the neighbor list file, initialize atom->map()
    read_file();
    atom->map_init();
}

void ComputeCustomDisp::compute_peratom() 
{   

    // check number of atoms
    if (atom->nmax > nmax) {
        memory->destroy(array_atom);
        nmax = atom->nmax;
        memory->create(array_atom, nmax, size_peratom_cols, "custom_disp:array_atom");
    }

    // reset the array_atom to 0
    for (int i = 0; i < atom->nlocal + atom->nghost; i++) {
        array_atom[i][0] = 0.0;
        array_atom[i][1] = 0.0;
        array_atom[i][2] = 0.0;
    }

    // assign the coord
    double **x = atom->x;
    
    // loop over all central atoms
    size_t i,j;
    for(i=0; i < central_id.size(); i++) {
        int central_global_id = central_id[i];
        int central_local_id = atom->map(central_global_id);

        if (central_local_id < 0 || central_local_id > atom->nlocal) continue; // skip if the central atom isn't in the processor

        // loop over all neighbors of the central atom
        double dx=0, dy=0, dz=0;
        double tmpx, tmpy, tmpz;
        int neighbor_count = 0;
        for(j = 0; j < neighbor_id[i].size(); j++) {
            int neighbor_global_id = neighbor_id[i][j];
            int neighbor_local_id = atom->map(neighbor_global_id);

            if (neighbor_local_id < 0) {
                error->warning(FLERR, "Cannot find the local index of the neighbor atom");
                continue;
            }

            tmpx = x[central_local_id][0] - x[neighbor_local_id][0];
            tmpy = x[central_local_id][1] - x[neighbor_local_id][1];
            tmpz = x[central_local_id][2] - x[neighbor_local_id][2];
            domain->minimum_image(tmpx, tmpy, tmpz);
            dx += tmpx;
            dy += tmpy;
            dz += tmpz;
            neighbor_count++;
        }
        dx /= neighbor_count;
        dy /= neighbor_count;
        dz /= neighbor_count;

        array_atom[central_local_id][0] = dx;
        array_atom[central_local_id][1] = dy;
        array_atom[central_local_id][2] = dz;
    }
}

To try to fix this, I stored local indexes in a vector during the void ComputeCustomDisp::init() and used them directly in void ComputeCustomDisp::compute_peratom(), which seemed to solve the problem.

According to LAMMPS’s domain decomposition scheme, the atoms in a certain region are held by a processor. Are atoms at the boundaries of a domain reassigned to different processors as they move across the boundaries? or are they attached to a fixed processor throughout the simulation?

Could you offer any insights or suggestions on why the original implementation will slow the simulations? If the atoms held by a processor are changing dynamically, my new solution will fail, is there any more effective to handle it?

Best regards,
Denan

srtee · October 18, 2024, 11:48pm

In general, you need to use a profiler to figure out which step in your code is the slowest. See this thread for an example of using profiler output to systematically optimise code:

Also note that LAMMPS provides a compute com/chunk that calculates the centre of mass of a “chunk” of atoms. If you can work out the chunking syntax (for example by giving each group of atoms the same molecular ID) you can calculate the centre of mass of each group of atoms, “spread” that back to each atom, and then calculate each atom’s distance from its own chunk COM (and therefore each central atom’s distance in the process).

Since you can write correct C++ code (which is more than I could when I started!), you may find it easier to copy and modify chunk computes to suit your needs instead of starting from scratch.

akohlmey · October 19, 2024, 12:57am

Please see the Programmer Guide section of the LAMMPS manual (or the recent LAMMPS paper). E.g. these sections:

MoseyQAQ · October 22, 2024, 10:59am

I’ve read the document provided by @akohlmey , and it has been very helpful.

According to @srtee , I also conducted profiling and discovered that the void ComputeCustomDisp::init() function is called with each loop of the simulation, while the destructor function is not called. When init() is invoked, the read_file() function reads the user-provided file and appends it to a vector. Consequently, the vector that stores the neighbor list continues to grow. During computation, we need to iterate over all elements in the vector, which results in slower simulation times.

At first, I read from the developer guide that the init() in compute will perform a one-time setup, I thought the init() would only be called at the very beginning of the simulation.

Thanks for your kind help!

akohlmey · October 22, 2024, 2:00pm

It is a one time-initialization, but that is per run. The canonical place for a one-time initialization for the lifetime of a class is the constructor.