GPU performance bottleneck: fix gcmc mol forcing CPU neighbor list builds

Dear all,

I am currently running hybrid GCMC/MD simulations to calculate nitrogen adsorption isotherms in carbonaceous frameworks. My setup utilizes fix gcmc with the mol keyword to insert N_2 molecules, alongside pair_style lj/cut/coul/long and kspace_style pppm.

I have encountered a severe performance bottleneck when attempting to accelerate these runs using the GPU package. Because fix gcmc mol requires excluding the intramolecular non-bonded interactions within the newly inserted molecules, it effectively creates a neigh_modify exclude condition. As a result, LAMMPS does not allow the neighbor list to be built on the GPU.

The only workaround I have found is to force the neighbor list build on the CPU using package gpu 1 neigh no. However, this introduces constant memory transfer overhead between the CPU host and the GPU device. Ultimately, this communication bottleneck makes the GPU-accelerated hybrid setup significantly slower than running it purely on CPUs.

I would like to ask the community:

  1. Is there any known workaround, undocumented package gpu setting, or kspace_style alternative that allows neighbor lists to remain on the GPU when utilizing molecular templates in GCMC?

  2. Is this a fundamental limitation of handling intramolecular exclusions and dynamic particle counts on the device, or are there plans to support this in future releases?

  3. Are there any alternative computational strategies you would recommend to mitigate this communication overhead and speed up molecular grand canonical ensembles on GPU nodes?

(Note: I am currently using the 2025 version of LAMMPS).

Any insights or advice on optimizing these runs would be greatly appreciated.

Best regards,

Anderson Arboleda

PhD Candidate in Physics

King’s College London

@AndersonArboledaLamu Most of your questions cannot be easily answered since there is too much context missing. For example:

  • What kind of GPU/CPU hardware do you have?
  • How large is your system and what cutoffs do you use (how many atoms, what density)?
  • How much GPU acceleration do you get when running plain MD without fix gcmc? How is the acceleration if you turn off the GPU neighbor lists for this case?
  • Have you confirmed that you must use a long-range electrostatics solver?

The main issue is non-trivial to resolve. The neighbor list code in the GPU package is optimized for speed and not for complexity (CPUs are better at that, anyway) and thus it is missing this feature.
Running hybrid MC/MD simulations in a code like LAMMPS that is built around domain decomposition is never going to be efficient, and that is even worse when using long-range Coulomb since you need to re-compute the whole reciprocal space contribution for every MC move. Since you need to do the MC steps infrequently, you could think breaking your run into many small chunks and turn off fix gcmc (and GPU neighbor lists on) while you run MD and then run an MC iteration on the CPU and switch back to running MD on the GPU.

Hi @akohlmey,

Thanks for getting back to me and explaining the underlying limitations of the GPU package regarding neighbor lists and complexity. To answer your questions:

Hardware: I am using the resources of the Materials and Molecular Modelling (MMM) Hub. For GPU runs, I use nodes with dual AMD EPYC 7543 32-core CPUs and an Nvidia A100 (40GB). For pure CPU scaling, I use Intel Xeon Gold 6248 (2.50GHz) nodes.

System Size & Performance Baseline: My baseline framework has a density of ~1 g/cc. For a smaller system (8,000 atoms, 14.0 Å cutoff), running pure CPU on 120 cores gives me about 25 ns/day. However, when I scale up to a massive 80,000 atom system and increase the number of cores to 960, it drops to a painful 2 ns/day. Normally, both set up with the smallest and largest system provide the same performance in ns/day.

Interestingly, when I run similar adsorption simulations using Argon instead of Nitrogen, a single A100 GPU + 1 CPU core gives me around 150 ns/day. Because Argon is monatomic, it doesn’t trigger the fix gcmc mol intramolecular exclusions, meaning the GPU neighbor lists work perfectly. The bottleneck only appears when I switch to N2​ molecules and the neighbor lists are forced back to the CPU.

Long-Range Electrostatics: I definitely need the long-range solver. I tested pure cutoffs, but the sharp truncation artificially tanks the GCMC insertion acceptance ratio. The PPPM solver is required to get realistic thermodynamics for this framework.

Your suggestion to break up the run is actually a great idea. I could write a loop in my input script that turns off fix gcmc (and turns GPU neighbor lists back on) for a block of pure MD, and then toggles back for the MC iterations.

I have attached my current scripts so you can see the setup. Do you have any specific recommendations on the best way to script that MD/MC loop to keep the overhead as low as possible?

Thanks again for the insights!

GCMC.inp (5.9 KB)

Framework_Full.dat (749.8 KB)

Nitrogen.mol (185 Bytes)

This is about 66 atoms per MPI rank. That is well beyond the strong scaling limit of LAMMPS for CPUs. On a GPU you need 10,000s of atoms per GPU for decent utilization.

When I turn off fix gcmc and other useless stuff like load balancing and tiled decomposition etc. in your example input deck and run on my local machine (a mini PC with an 8-core mobile CPU)
I get the following numbers:

log.1:Loop time of 15.3312 on 1 procs for 10000 steps with 8378 atoms
log.1-Performance: 56.356 ns/day, 0.426 hours/ns, 652.266 timesteps/s, 5.465 Matom-step/s
--
log.2:Loop time of 8.96525 on 2 procs for 10000 steps with 8378 atoms
log.2-Performance: 96.372 ns/day, 0.249 hours/ns, 1115.418 timesteps/s, 9.345 Matom-step/s
--
log.4:Loop time of 5.49531 on 4 procs for 10000 steps with 8378 atoms
log.4-Performance: 157.225 ns/day, 0.153 hours/ns, 1819.733 timesteps/s, 15.246 Matom-step/s
--
log.8:Loop time of 3.79417 on 8 procs for 10000 steps with 8378 atoms
log.8-Performance: 227.718 ns/day, 0.105 hours/ns, 2635.623 timesteps/s, 22.081 Matom-step/s

You can see that already here the parallel scaling is not ideal. Already with 8 processors it is down to about 50% parallel efficiency. This one is particularly challenging, since you exclude all pairwise interactions of the framework atoms, so your performance is effectively only the PPPM performance (which means it is completely pointless to use the GPU package for this).

That is still considered a tiny system that can run well on a single 32-core CPU.
A massive system would have millions, if not billions of atoms.

Because both are similarly far into the limit of strong parallel scaling.
I would expect that at this level, you would get better performance with fewer CPU cores and also I would try to use a hybrid OpenMP/MPI approach to reduce the communication bandwidth demand. With such small systems and large number of processors your individual subdomains will mostly consist of ghost atoms and thus there communication time is likely substantial. Also the 3d-FFTs of a PPPM long-range solver will have a lot of cost at this setup.

This is an incorrect interpretation. The major difference is that a simulation with a pair-wise additive cutoff potential allows a massively more efficient way to implement MC insertions or removals or displacements. LAMMPS has an option to compute interactions for single pairs with the Pair::single() method. Thus you only need to look at all interacting pairs and you can sum over the interactions and directly compute how the energy changes. With a non-pairwise additive potential, you need to do a full evaluation of the interactions, including the costly long-range solver. This needs the exclusions since you need to do a before and after computation and because of neighbor list rebuilds, you have to have both the before and after atoms/molecules in the geometry and then just selectively exclude them from the computation.
With the GPU package you have the additional complication that the GPU acceleration is predominantly available for the non-bonded pair wise computations.

There are compensated smooth coulomb variants like /wolf or /dsf. Also, you can afford to crank up the cutoff, especially when using a GPU (you need to increase the neighbor list page size for that, though).

As already mentioned, before even looking at GCMC, you need to get good performance without. I would be helpful to do this with a system that is at least partially filled with N2 molecules. There are too many details in your input that don’t make sense to me to list and explain them all. You need to talk to somebody (local!) that knows about parallel performance and benchmarking and can tutor you. I don’t have the time to give you a personal training on this. If you search through the archives, you may find some older posts here where I and others are discussing similar things.

Dear, Axel,

Could you share the script you modified to run this simulation using MPI? I’d really appreciate it, as I haven’t managed to achieve the performance you’ve achieved.

I’m developing a script to implement your approach of turning off the GCMC fix when running MD. So far, it seems to speed up the simulation incredibly when using a 1:1 CPU-to-GPU ratio. However, I’d like to explore the use of hybrid MPI and OpenMPI.

Best regards,

Anderson

here are the changes I made:

--- /home/akohlmey/Downloads/GCMC.inp	2026-05-29 10:49:08.989628892 -0400
+++ GCMC.inp	2026-05-29 10:48:45.321746484 -0400
@@ -11,11 +11,11 @@
 pair_style      lj/cut/coul/long 14.0
 pair_modify     mix arithmetic tail yes
 kspace_style    pppm 1.0e-4
-kspace_modify   mesh 24 24 24
+#kspace_modify   mesh 24 24 24
 bond_style      harmonic
 special_bonds   lj/coul 0.0 0.0 0.0
 
-comm_style      tiled
+#comm_style      tiled
 
 # ----------------- GEOMETRY & TYPES -----------------
 # Reserve memory for 2 gas types, plus memory for the fake bonds
@@ -75,7 +75,7 @@
 fix 		myrigid gas rigid/small molecule mol n2 langevin 77.0 77.0 0.1 482749
 
 
-thermo          1000
+thermo          100
 thermo_style    custom time step pe ke etotal temp vol density press enthalpy ecouple econserve f_myrigid c_T_gas vol press v_N_N2 v_q_mmol_g v_q_cc_g
 thermo_modify   temp T_gas
 
@@ -93,19 +93,19 @@
 variable        phi_current equal ${phi_target}
 
 # Create a unique log file for each pressure point
-log             log_N2_77K_${P_current}bar.txt
+#log             log_N2_77K_${P_current}bar.txt
 
 # ----------------- TARGET GCMC ----------------
 # Atom type kept at 0 per explicit instruction. Added fugacity_coeff keyword.
-fix             mygcmc gas gcmc 500 100 0 0 84729 77.0 0.0 0.5 mol n2 rigid myrigid pressure ${P_current} fugacity_coeff ${phi_current}
-fix             20 all balance 50000 1.05 shift xyz 10 1.05 weight time 1.0
+#fix             mygcmc gas gcmc 500 100 0 0 84729 77.0 0.0 0.5 mol n2 rigid myrigid pressure ${P_current} fugacity_coeff ${phi_current}
+#fix             20 all balance 50000 1.05 shift xyz 10 1.05 weight time 1.0
 
 # ----------------- SETUP DUMP -----------------
 dump            1 all custom 10000 adsorption_N2_77K_${P_current}bar.lammpstrj id type q x y z ix iy iz vx vy vz fx fy fz
 dump_modify     1 sort id
 
 # ================= PHASE 1: EQUILIBRATION =================
-if "${P_current} == 1.0069782782e-10" then "run 5000000" else "run 4000000"
+#if "${P_current} == 1.0069782782e-10" then "run 5000000" else "run 4000000"
 
 write_data      Equilibration_N2_${P_current}bar.dat
 write_restart   RE_Equilibration_N2_${P_current}bar
@@ -113,7 +113,7 @@
 # ================= PHASE 2: PRODUCTION & AVERAGING ========
 fix             prod_ave gas ave/time 500 200 100000 v_q_mmol_g v_q_cc_g c_T_gas ave one file production_N2_data_${P_current}bar.txt
 
-run             1000000
+run             10000
 
 # ================= EXTRACT & SAVE RESULTS =================
 variable        final_mmol equal f_prod_ave[1]
@@ -128,15 +128,15 @@
 
 # ================= CLEANUP FOR NEXT PRESSURE =================
 # Unfix local loop constraints while keeping the rigid dynamics active
-unfix           mygcmc
-unfix           prod_ave
-unfix		20
+#unfix           mygcmc
+#unfix           prod_ave
+#unfix		20
 undump          1
 
 # Advance BOTH indexes and restart the loop
-next            p_target phi_target
-jump            SELF isotherm_loop
+#next            p_target phi_target
+#jump            SELF isotherm_loop
 
 # ----------------- END OF SCRIPT --------------------
-log             log.lammps
+#log             log.lammps
 print           "I S O T H E R M   C O M P L E T E"

Dear Axel,

I truly appreciate your help with this matter and the performance breakdown you provided. Following your advice, I split the simulation into two separate scripts to isolate the GPU-accelerated MD from the CPU-bound GCMC. The pure MD script runs great (around 50 ns/day).

For the GCMC step (2000 exchanges), I scaled the framework execution down to 10 cores to avoid the strong-scaling penalties you mentioned. However, the 3D-FFTs for the PPPM solver are still taking an excessive amount of time during the full_energy evaluations, between 25–60 seconds for the GCMC step, and should be call around 600 times during the entered MD-GCMD.

You previously suggested trying smooth coulomb variants like /wolf or /dsf, or increasing the straight cutoff to avoid PPPM entirely. I benchmarked both options, but ran into severe limitations with this specific framework:

  1. Straight Cutoff (24 Å): Dropping PPPM and using a large straight cutoff along with lj/cut/coul/cut, my equilibrium nitrogen insertions exactly in half (100 molecules vs. the 200 captured by PPPM), indicating that the infinite long-range lattice tails are physically necessary to capture the correct loading.
  2. Smooth Variants (coul/wolf and coul/dsf): When I switched to Wolf and DSF to avoid the K-space mesh, the GCMC step slowed down drastically, exploding to 300 seconds per step. Because fix gcmc forces a full_energy evaluation for molecule templates, calculating the damped pair-by-pair interactions in real-space for every trial position choked the CPUs much worse than the PPPM grid assignments.

Since the physics require an accurate long-range description like PPPM, and the GCMC molecular exclusions lock me out of the GPU package for this style, do you have any recommendations on how to handle the PPPM grid communication or mesh settings more efficiently on the CPU side? I have experimented with OpenMP to reduce communication bandwidth, but the runtime remains heavily choked in the K-space steps.

I would really appreciate any further insights you might have.

Best regards,

Anderson

in.gcmc_step (2.2 KB)

Nitrogen.mol (185 Bytes)

Framework_Full.dat (749.8 KB)

There is not much else that is worth thinking about.

  • since your systems are relatively small, you could try using ewald instead of pppm, the support for OpenMP should be a bit better, but the worse scaling can easily eat up this benefit
  • you could look for a different MC code that may have better performance or GPU acceleration than LAMMPS. LAMMPS was not designed for this kind of task.

The real “killer” is that you cancel the most of the pairwise interactions. Since you worry about accuracy, you should be aware that applying neighbor list exclusions does not apply to the Kspace calculations. Those force and energy contributions are always computed and thus your setup is actually not very accurate because of the exclusions.