I am currently running hybrid GCMC/MD simulations to calculate nitrogen adsorption isotherms in carbonaceous frameworks. My setup utilizes fix gcmc with the mol keyword to insert N_2 molecules, alongside pair_style lj/cut/coul/long and kspace_style pppm.
I have encountered a severe performance bottleneck when attempting to accelerate these runs using the GPU package. Because fix gcmc mol requires excluding the intramolecular non-bonded interactions within the newly inserted molecules, it effectively creates a neigh_modify exclude condition. As a result, LAMMPS does not allow the neighbor list to be built on the GPU.
The only workaround I have found is to force the neighbor list build on the CPU using package gpu 1 neigh no. However, this introduces constant memory transfer overhead between the CPU host and the GPU device. Ultimately, this communication bottleneck makes the GPU-accelerated hybrid setup significantly slower than running it purely on CPUs.
I would like to ask the community:
Is there any known workaround, undocumented package gpu setting, or kspace_style alternative that allows neighbor lists to remain on the GPU when utilizing molecular templates in GCMC?
Is this a fundamental limitation of handling intramolecular exclusions and dynamic particle counts on the device, or are there plans to support this in future releases?
Are there any alternative computational strategies you would recommend to mitigate this communication overhead and speed up molecular grand canonical ensembles on GPU nodes?
(Note: I am currently using the 2025 version of LAMMPS).
Any insights or advice on optimizing these runs would be greatly appreciated.
@AndersonArboledaLamu Most of your questions cannot be easily answered since there is too much context missing. For example:
What kind of GPU/CPU hardware do you have?
How large is your system and what cutoffs do you use (how many atoms, what density)?
How much GPU acceleration do you get when running plain MD without fix gcmc? How is the acceleration if you turn off the GPU neighbor lists for this case?
Have you confirmed that you must use a long-range electrostatics solver?
The main issue is non-trivial to resolve. The neighbor list code in the GPU package is optimized for speed and not for complexity (CPUs are better at that, anyway) and thus it is missing this feature.
Running hybrid MC/MD simulations in a code like LAMMPS that is built around domain decomposition is never going to be efficient, and that is even worse when using long-range Coulomb since you need to re-compute the whole reciprocal space contribution for every MC move. Since you need to do the MC steps infrequently, you could think breaking your run into many small chunks and turn off fix gcmc (and GPU neighbor lists on) while you run MD and then run an MC iteration on the CPU and switch back to running MD on the GPU.
Thanks for getting back to me and explaining the underlying limitations of the GPU package regarding neighbor lists and complexity. To answer your questions:
Hardware: I am using the resources of the Materials and Molecular Modelling (MMM) Hub. For GPU runs, I use nodes with dual AMD EPYC 7543 32-core CPUs and an Nvidia A100 (40GB). For pure CPU scaling, I use Intel Xeon Gold 6248 (2.50GHz) nodes.
System Size & Performance Baseline: My baseline framework has a density of ~1 g/cc. For a smaller system (8,000 atoms, 14.0 Å cutoff), running pure CPU on 120 cores gives me about 25 ns/day. However, when I scale up to a massive 80,000 atom system and increase the number of cores to 960, it drops to a painful 2 ns/day. Normally, both set up with the smallest and largest system provide the same performance in ns/day.
Interestingly, when I run similar adsorption simulations using Argon instead of Nitrogen, a single A100 GPU + 1 CPU core gives me around 150 ns/day. Because Argon is monatomic, it doesn’t trigger the fix gcmc mol intramolecular exclusions, meaning the GPU neighbor lists work perfectly. The bottleneck only appears when I switch to N2 molecules and the neighbor lists are forced back to the CPU.
Long-Range Electrostatics: I definitely need the long-range solver. I tested pure cutoffs, but the sharp truncation artificially tanks the GCMC insertion acceptance ratio. The PPPM solver is required to get realistic thermodynamics for this framework.
Your suggestion to break up the run is actually a great idea. I could write a loop in my input script that turns off fix gcmc (and turns GPU neighbor lists back on) for a block of pure MD, and then toggles back for the MC iterations.
I have attached my current scripts so you can see the setup. Do you have any specific recommendations on the best way to script that MD/MC loop to keep the overhead as low as possible?
This is about 66 atoms per MPI rank. That is well beyond the strong scaling limit of LAMMPS for CPUs. On a GPU you need 10,000s of atoms per GPU for decent utilization.
When I turn off fix gcmc and other useless stuff like load balancing and tiled decomposition etc. in your example input deck and run on my local machine (a mini PC with an 8-core mobile CPU)
I get the following numbers:
log.1:Loop time of 15.3312 on 1 procs for 10000 steps with 8378 atoms
log.1-Performance: 56.356 ns/day, 0.426 hours/ns, 652.266 timesteps/s, 5.465 Matom-step/s
--
log.2:Loop time of 8.96525 on 2 procs for 10000 steps with 8378 atoms
log.2-Performance: 96.372 ns/day, 0.249 hours/ns, 1115.418 timesteps/s, 9.345 Matom-step/s
--
log.4:Loop time of 5.49531 on 4 procs for 10000 steps with 8378 atoms
log.4-Performance: 157.225 ns/day, 0.153 hours/ns, 1819.733 timesteps/s, 15.246 Matom-step/s
--
log.8:Loop time of 3.79417 on 8 procs for 10000 steps with 8378 atoms
log.8-Performance: 227.718 ns/day, 0.105 hours/ns, 2635.623 timesteps/s, 22.081 Matom-step/s
You can see that already here the parallel scaling is not ideal. Already with 8 processors it is down to about 50% parallel efficiency. This one is particularly challenging, since you exclude all pairwise interactions of the framework atoms, so your performance is effectively only the PPPM performance (which means it is completely pointless to use the GPU package for this).
That is still considered a tiny system that can run well on a single 32-core CPU.
A massive system would have millions, if not billions of atoms.
Because both are similarly far into the limit of strong parallel scaling.
I would expect that at this level, you would get better performance with fewer CPU cores and also I would try to use a hybrid OpenMP/MPI approach to reduce the communication bandwidth demand. With such small systems and large number of processors your individual subdomains will mostly consist of ghost atoms and thus there communication time is likely substantial. Also the 3d-FFTs of a PPPM long-range solver will have a lot of cost at this setup.
This is an incorrect interpretation. The major difference is that a simulation with a pair-wise additive cutoff potential allows a massively more efficient way to implement MC insertions or removals or displacements. LAMMPS has an option to compute interactions for single pairs with the Pair::single() method. Thus you only need to look at all interacting pairs and you can sum over the interactions and directly compute how the energy changes. With a non-pairwise additive potential, you need to do a full evaluation of the interactions, including the costly long-range solver. This needs the exclusions since you need to do a before and after computation and because of neighbor list rebuilds, you have to have both the before and after atoms/molecules in the geometry and then just selectively exclude them from the computation.
With the GPU package you have the additional complication that the GPU acceleration is predominantly available for the non-bonded pair wise computations.
There are compensated smooth coulomb variants like /wolf or /dsf. Also, you can afford to crank up the cutoff, especially when using a GPU (you need to increase the neighbor list page size for that, though).
As already mentioned, before even looking at GCMC, you need to get good performance without. I would be helpful to do this with a system that is at least partially filled with N2 molecules. There are too many details in your input that don’t make sense to me to list and explain them all. You need to talk to somebody (local!) that knows about parallel performance and benchmarking and can tutor you. I don’t have the time to give you a personal training on this. If you search through the archives, you may find some older posts here where I and others are discussing similar things.
Could you share the script you modified to run this simulation using MPI? I’d really appreciate it, as I haven’t managed to achieve the performance you’ve achieved.
I’m developing a script to implement your approach of turning off the GCMC fix when running MD. So far, it seems to speed up the simulation incredibly when using a 1:1 CPU-to-GPU ratio. However, I’d like to explore the use of hybrid MPI and OpenMPI.
I truly appreciate your help with this matter and the performance breakdown you provided. Following your advice, I split the simulation into two separate scripts to isolate the GPU-accelerated MD from the CPU-bound GCMC. The pure MD script runs great (around 50 ns/day).
For the GCMC step (2000 exchanges), I scaled the framework execution down to 10 cores to avoid the strong-scaling penalties you mentioned. However, the 3D-FFTs for the PPPM solver are still taking an excessive amount of time during the full_energy evaluations, between 25–60 seconds for the GCMC step, and should be call around 600 times during the entered MD-GCMD.
You previously suggested trying smooth coulomb variants like /wolf or /dsf, or increasing the straight cutoff to avoid PPPM entirely. I benchmarked both options, but ran into severe limitations with this specific framework:
Straight Cutoff (24 Å): Dropping PPPM and using a large straight cutoff along with lj/cut/coul/cut, my equilibrium nitrogen insertions exactly in half (100 molecules vs. the 200 captured by PPPM), indicating that the infinite long-range lattice tails are physically necessary to capture the correct loading.
Smooth Variants (coul/wolf and coul/dsf): When I switched to Wolf and DSF to avoid the K-space mesh, the GCMC step slowed down drastically, exploding to 300 seconds per step. Because fix gcmc forces a full_energy evaluation for molecule templates, calculating the damped pair-by-pair interactions in real-space for every trial position choked the CPUs much worse than the PPPM grid assignments.
Since the physics require an accurate long-range description like PPPM, and the GCMC molecular exclusions lock me out of the GPU package for this style, do you have any recommendations on how to handle the PPPM grid communication or mesh settings more efficiently on the CPU side? I have experimented with OpenMP to reduce communication bandwidth, but the runtime remains heavily choked in the K-space steps.
I would really appreciate any further insights you might have.
There is not much else that is worth thinking about.
since your systems are relatively small, you could try using ewald instead of pppm, the support for OpenMP should be a bit better, but the worse scaling can easily eat up this benefit
you could look for a different MC code that may have better performance or GPU acceleration than LAMMPS. LAMMPS was not designed for this kind of task.
The real “killer” is that you cancel the most of the pairwise interactions. Since you worry about accuracy, you should be aware that applying neighbor list exclusions does not apply to the Kspace calculations. Those force and energy contributions are always computed and thus your setup is actually not very accurate because of the exclusions.