Gpu+mpirun error: Too many neighbors on GPU. Use neigh_modify one to increase limit

…and here is another suggestion for a workaround. When LAMMPS suggests to increase the size reserved for a single neighbor list, it just gives a suggestion for the most common case of the kind of error happening. But those errors are common for dense systems while yours is sparse. That make a much less common scenario possible: the size of the neighbor list bins is too small. This is set by default to half of the cutoff, but due to the sparsity of your system, this is not as efficient (or required to avoid overflows and inefficiencies).

If I run your input with the flags -pk gpu 0 binsize 12.0 appended, the errors seem to be going away (at least the immediate ones).

Here is a quick summary of what has transpired during this discussion:

  • having a sparse system makes it more difficult to get good GPU acceleration, especially when you have only a single GPU and many CPU cores.
  • in such a scenario, it is usually much more efficient to not use GPU acceleration for pppm but only on the pair style (which is much more efficient) and then run pppm completely on the CPU and in parallel. you can append “-pk gpu 0 pair/only on” to select that mode.
  • you can use the balance command to optimize the subdomain division and also minimize the risk of having subdomains without atoms (we have added a bugfix to a pending pull request for this scenario)
  • you can improve on load balancing for sparse systems by switching to “tiled” communication (instead of the default and more efficient “brick” scheme) and then use “balance 1.0 rcb” to create subdomains with recursive bisectioning. this method guarantees to have a similar number of atoms per subdomain. this is particularly important and beneficial when running on the CPU.
  • if increasing the neighbor list “one” parameter doesn’t make a difference, you can try to increase the “binsize” parameter for the GPU neighbor lists instead. it defaults to half the largest pair style cutoff.
  • sometimes switching between OpenCL and CUDA can make a difference
  • if you cannot work around a GPU neighbor list issue, you can try using the CPU generated neighbor lists by appending “neighbor no” to the “-pk gpu” flag.

Thank you for the summary, and it is so useful for my simulations.
When I use “balance 1.0 shift xyz 10 1.0” after the read_data command, the “mpirun -np 10 lmp -sf gpu -pk gpu 2 -in run4.txt” works without errors.

But after 5 mins, I got the error " Gpu+mpirun error: Too many neighbors on GPU. Use neigh_modify one to increa…" I attached the 3 files here in case anyone want to experiment. (https://drive.google.com/drive/folders/15TXqLY__15G7sp42N2vnsvYKkIKWB44j?usp=sharing)
Then I use "“mpirun -np 10 lmp -sf gpu -pk gpu 2 binsize 12.0 -in run4.txt”, this error “too many neighbors on GPU” goes away. So I believe that appending “binsize 12.0”, or using the CPU generated neighbor lists by appending “neighbor no” to the “-pk gpu” flag, will solve this problem.

I think I will compile with -DGPU_API=cuda for my 2 GPUs.

Thank you for all those super helpful techniques!

One more note. Using the balance command will readjust the subdomains only once. As your molecules and atoms are moving around, that optimum will change. For that purpose, there also is a fix balance command — LAMMPS documentation
which invokes the balance command periodically. You may want to do this even if you don’t have any issues with the neighbor list error messages, as it should improve parallel efficiency.

Another point to try out to optimize performance is to change the cutoff for coulomb interactions. That will not change the total accuracy, but the balance between work done on the GPU and on the CPU. Due to the particular nature of your input geometry, typical values may be less efficient and in particular, you can afford a much longer cutoff since your workload does not increase as much. While the workload for the PPPM algorithm is independent from the number of atoms, but depends on the volume and number of gridpoints. For large cutoffs you will have to increase the “neigh_modify one” and “neigh_modify page” settings, though. With “pair/only on” your GPU calculation will be run in parallel with the CPU calculation and thus you can give more work to the GPU, even if it would be inefficient on its own, since it comes “for free” for as long as it is done before you are done with the CPU work. So the goal would be to reduce the CPU work, which is done by increasing the GPU work through increasing the cutoff until they cancel.

Lastly, I solvated the system into electrolytes (H2O and NaCl). Since the system is not sparse anymore(density = 0.03 initially, after npt becomes 1.05 in real unit), I didn’t get Segmentation fault (11) and "signal aborted (6) ".
Eventually the simulation finish. Here I really thank you for the summaries and details that help me finish the simulation. The run command I use is mpirun -np 20 lmp -sf gpu -pk gpu 2 binsize 12.0 pair/only on -in run0.txt. I use 2 GPUs and 20 CPUs. with OMP_NUM_THREADS=1.

If I don’t use binsize 12.0 pair/only on in the command, I will get various errors such as: " cannot compute PPPM, missing atoms", “Angle atoms 2576 2577 2578 missing on proc 19”, “Bond atoms 124 125 missing on proc 4”, etc… So my experience is if there are such errors, it is not necessary the bad dynamics, it could be caused by GPU/CPU allocation.

As a closing comment. We have implemented a workaround that automatically does the same as using the binsize keyword and that seems to be capable of avoiding any of the reported issues. It will be available in the next (stable) LAMMPS version. We believe that this is not the real solution, but some deeper understanding of what happens is required. However, that will take time and LAMMPS seems to be working better with the workaround in place than without.

1 Like