Performance of fix atom/swap

Has anyone benchmarked the performance of “fix atom/swap”? I’m using it a lot, probably in a mode that’s different from what it’s intended for, because I’m doing pure MC, so there aren’t large numbers of MD steps between swap steps. I have a couple of other custom MC fixes (for positions and cell DOFs), and that code runs fine, but when I start spending about 20% of my moves on atom swaps, it slows down incredibly (factor of about 10, even though only 20% of my moves are swap moves). The other thing that’s odd is that it seems to get slower with iteration.

I’m going to investigate further, but has anyone ever checked the performance of atom/swap?

In a somewhat related question, can anyone explain why fix_atom_swap.cpp calls the full sequence of x2lamda, pbc, exchange, borders, lamda2x, pre_neighbor, and neighbor-build when it detects unequal cutoffs? Since it doesn’t move atoms, why all the calls to functions that have to do with assignments of atoms to processors? I would have thought that at most it would need a neighbor list rebuild, if those lists are dependent on specific atom-pair cutoffs.

This is because LAMMPS is a parallel code using domain decomposition parallelization. When you are changing atom types with different cutoffs this can also affect which atoms are considered ghost atoms (when using multi-cutoff communication) and thus you need to do all those steps to ensure you have all possible neighbors communicated and their properties updated before rebuilding the neighbor lists.

Bottom line, turning LAMMPS into an MC code is going to cause all kinds of performance issues exactly because of the domain decomposition. There is not much that you can do about it except implementing your own MC “driver” that does not do domain decomposition, but uses the same data structures as LAMMPS so you can “recycle” code from LAMMPS where domain decomposition is not relevant.

Thanks for the explanation. It’s not necessarily a showstopper for me, because I’d be happy to run without using any MPI parallelization in LAMMPS, which would presumably make many of those function calls much faster. I’m investigating the slowdown further, because I think that even if LAMMPS does do all these things the slowdown seems excessive. If I have more specific info, I’ll follow up.

That is not what I mean. Even if you compile LAMMPS without an MPI library, LAMMPS will substitute it with a stub library. LAMMPS is very much designed around using MPI and you cannot avoid it, even in a serial compilation. LAMMPS will always collect data into communication buffers and use copies of atoms to implement periodic boundaries. That is why LAMMPS is not subject to minimum image conventions. When you write an MD or MC code that does not do domain decomposition and where conforming to minimum image conventions is not an issue, there are a lot of steps that LAMMPS does to efficiently handle domain decomposition in parallel for a large number of MPI ranks, that can be avoided or done differently. E.g. lots of data doesn’t have to be tied to individual atoms but could be stored in global lists.

I suggest you have a look at the new LAMMPS paper, where the core design of LAMMPS is explained.

All I meant was that I’d expect the overhead to be smaller (not zero), because the MPI stub library would probably be faster than actual MPI communication, perhaps to the point that it would be fast enough for me. That’s all.

I don’t think so. In the serial case there is no communication happening either way, just copying of buffers, if at all. I would expect that any decent MPI library will be doing mostly the same for the “non-communication” 1 rank scenario. I would only expect a significant difference in the MPI_Init() function, which is a one time operation.

Good point - I didn’t really mean to emphasize MPI stubs vs. actual MPI. I was really focusing on single process (which doesn’t require actual interprocess communication, just buffer copying), whether MPI stubs or a well optimized MPI, and multiple process MPI.

In any case, I’m currently trying to measure just how much slower atom/swap steps are, as compared to MD steps or my own position/cell MC Steps (which are fast enough for me, whether or not they are a lot slower than MD steps). My preliminary impression is that it’s a very large factor, but I’m trying to measure more carefully.

Looks like the presence of the “fix atom/swap” steps causes all other LAMMPS steps to slow down. My code in general slows down a bit as it progresses, but I think that’s just because the system density goes up and so do the numbers of neighbors. However, when I start using atom/swap steps all the other steps slow down a lot (as much as factors of 10, and looks like it continues to get worse).

Does anyone have suggestions on how to figure out what within LAMMPS is slowing down, e.g. using gprof or some other profiler?

In particular, gprof’s total time is much less than the actual walltime when I use atom/swap steps, but it’s pretty close when I don’t. As a result, I’m not sure where the time is being spent.

gprof is dependent on having instrumented code (and you won’t get info about non-instrumented or inlined code).

as a first step I usually use perf with the record then report option.
you can observe running processes with perf top.

Thanks. For gprof I compiled both LAMMPS as a library and my own code with -pg, and also switched to using a static library, but it wasn’t enough to fix the missing time.

I’m playing with operf now, and I’ll try perf next. operf is reporting a lot of samples in int_mallinfo (top routine, 60% of the total samples), but I can’t get any meaningful call graph info about it. I’m assuming that it’s being called many more times, rather than each call being much slower, but I that’s just speculation. I don’t see any sign that it can be called by lammps explicitly except for specific informational commands (which I don’t see atom/swap calling), but maybe I missed something.

perf is giving the same results (nearly 60% of the time in int_mallinfo), but I can’t figure out what might be calling it.

malloc()?

Can produce a minimal input that triggers the large time usage of int_mallinfo but doesn’t require any custom software?

I tried to reproduce the increasing time usage with pure LAMMPS, but haven’t succeeded so far. I’ll play with it a bit more, if nothing else to figure out whether there’s still a bug in the other custom fixes I created (which are being interspersed with atom/swap in my normal usage, but not in the pure LAMMPS test). I’ll post more if/when I figure anything out.

I was actually able to get my own atom swapping fix to work, once I figured out that I need to use things like force_reneighbor and next_reneighbor. It’s definitely more limited than atom/swap, but it’s sufficient for my use case, at least for now.