Gpu+mpirun error: Too many neighbors on GPU. Use neigh_modify one to increase limit

zchen · September 9, 2021, 1:46pm

I use “mpirun -np 4 lmp -sf gpu -pk gpu 1 -in run0.txt” to simulate a long polymer chain(negative charge -60) and Na+ ions in Vacuum, NVT conditions. There are 1616 atoms in total.
In the middle of the run, I got error: Too many neighbors on GPU. Use neigh_modify one to increase limit.
As I check the trajectory in VMD, before the errors, the Na+ ions fly away from the polymer and then come back, fly away then comes back.

What is the cause of this error and how to fix it? The system runs fine if I don’t use GPU. But this will be slow if there are many atoms.

akohlmey · September 9, 2021, 2:04pm

What version of LAMMPS are you using and how did you compile GPU support?

Would you mind sharing your test input deck so the unexpected behavior can be independently verified?

zchen · September 15, 2021, 2:56pm

Hi Alex:
Thanks you for looking it. Sorry for my late reply since I haven’t log into this forum some days. I want to share the test files in the attachment, but new users cannot upload. How to share test input deck? I try put them in the google drive for public access. [Google Drive: Sign-in]

It is the latest LAMMPS version. I compiled GPU support with OpenCL and default configuration.

=============================================
Some days ago, I simulated the same system(PSS:PEDOT:Na+) but with more atoms, and weird errors occurs. I can also share the input files of this system.

with 4 cores + gpu “mpirun -np 4 lmp -sf gpu -pk gpu 1 -in run2.txt”
error is:
“[neurion2:414314] *** Process received signal ***
[neurion2:414314] Signal: Segmentation fault (11)
[neurion2:414314] Signal code: Address not mapped (1)”
with 3 cores + gpu “mpirun -np 3 lmp -sf gpu -pk gpu 1 -in run2.txt”
error is:
“OpenCL error in file ‘/home/ruben/Software/lammps/lib/gpu/geryon/ocl_kernel.h’ in line 468 : -54.
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.”
I remember with 3 or 4 cores, I also got "signal aborted (6) ".

The system runs correctly without gpu like “mpirun -np 20 lmp -in run2.txt”. Or 1 core + gpu also runs correctly.
What is the reason for those LAMMPS behaviors, and how to fix them?

Thank you so much for checking a bit.

Sincere,
Zhongquan Chen

akohlmey · September 15, 2021, 3:16pm

This is a very imprecise description. The “latest” depends on when you downloaded it and whether you are referring to the latest stable version, the latest patch version or the development version (from the git repository). The best way it to take the version info from the ./lmp -h output. For example my current version that I am using for testing is:

Git info (master / patch_31Aug2021-58-g0dd35bdb66)

which says it is taken from the branch “master” and is 58 commits after the 31 August 2021 patch release and referring to the (git) commit starting with 0dd35bdb66

akohlmey · September 15, 2021, 3:18pm

You have to set the “sharing” settings for that URL to “anybody with the link”.

akohlmey · September 15, 2021, 3:19pm

Please also provide the specs for your GPU and the output from ocl_get_devices.

zchen · September 15, 2021, 3:34pm

The version is:
Git info (master / patch_30Jul2021-60-gdad9942bb8)

zchen · September 15, 2021, 3:37pm

The files are in:
https://drive.google.com/drive/folders/1CRG0QZvlE_a4gekV4lkl2rQbq8fe4_Lw?usp=sharing

The 3 files for the larger system (12407 atoms) that gives weird errors as described above, are in this link.
https://drive.google.com/drive/folders/1uJLLWZxGHnJJTQndT-f_Vgu49PQ2_kkE?usp=sharing

akohlmey · September 15, 2021, 3:44pm

There you go. This is not “latest” at all. This is currently 420 commits behind the main development branch.

zchen · September 15, 2021, 3:50pm

My GPU is NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 .

the message of output from ocl_get_devices is not written in the log.lammps file. I need to run the simulation again to generate them.

akohlmey · September 15, 2021, 3:55pm

This does not describe the GPU, but the driver and the library querying the status.

ocl_get_devices is a separate command. You do not need to run the calculation again.

But what matters most is that you probably have mixed precision selected.
When I run the first input, I see very larger Lennard-Jones energy in your thermo output.
Perhaps you should first run a minimization on the CPU and write out the data file again before you use the GPU acceleration. With single or mixed precision, the risk of having the forces overflow is 10e^8 times higher than with double precision.

zchen · September 15, 2021, 4:07pm

I will experiment minimization with CPU, since only with cpus, the LAMMPS runs fine. For the mixed precision thing, I need to discussed with a college who maintains the computing cluster and install LAMMPS.

How to "run lammps/lib/gpu/ocl_get_devices " to get the information of the list of device?

akohlmey · September 15, 2021, 4:10pm

Ok, I think I have figured out what is going wrong here.

The main issue seems to be that you have a very sparse system. I do not get the “too many neighbors” error and this should not happen because of the sparsity (this also means that the benefit from GPU acceleration is limited in comparison with a dense bulk system).

I can run with OpenCL on a single GPU
Device 0: NVIDIA GeForce GTX 1060 6GB, 10 CUs, 5.2/5.9 GB, 1.7 GHZ (Double Precision)
but get crashes when using multiple MPI tasks. This can be avoided by inserting the following command after the read_data command:
balance 1.0 shift xyz 10 1.0
This will shift the subdomain boundaries to have a more optimal particle distribution and - in this case - no more subdomains without atoms.

zchen · September 15, 2021, 4:15pm

I see, the sparsity of system. I will runs some system with the “balance 1.0 shift xyz 10 1.0” for my varies systems and see if it works. Then I will update the results/errors here.

Actually I don’t want to have a very sparse system. Eventually I need to use fix npt to compress the system to a density close to 1g/cm-3 and then solvate in 50000 water. I created this sparse system in Python because: 1. I don’t have the initial configuration, 2. I want to avoid too much overlap.

“too many neighbors on GPU” is the error in the middle of the run. If I keep running the system, it will give this error after a day or so.

akohlmey · September 15, 2021, 4:19pm

There are more advanced balancing options available. Please see the documentation of the balance command. Using recursive bisectioning is probably even better in your case. You can also use fix balance to having this balancing re-applied occasionally.

akohlmey · September 15, 2021, 4:22pm

This is not something that I have the time to test for and debug. You would have to regularly write out restarts and then convert the latest restart before the crash to a data file and check for how long you need to run until the error happens. If that is soon, I can have a look. But first check out whether the balance command changes this. You may just have corrupted data on the GPU due to the lack of atoms. However, if the issue persists, then the most likely explanation is that there is a weakness in your parameters where atoms sometimes get too close and then cause large forces and ruin things from there on.

zchen · September 15, 2021, 4:24pm

Yes I will follow this logic and test.Then keep it updated.

akohlmey · September 15, 2021, 4:24pm

If the force field parameters are designed for water being present and you simulate without, then you can run into serious issues. Since water will “stick” to charged particles rather tightly, it provides a “shield” that is not present in your setup and thus you can get otherwise “impossible” contacts.

zchen · September 16, 2021, 12:52pm

Device 0: “NVIDIA GeForce GTX 1080 Ti”
Type of device: GPU
Supported OpenCL Version: 3.0
Is a subdevice: No
Double precision support: Yes
Total amount of global memory: 10.9165 GB
Number of compute units/multiprocessors: 28
Total amount of constant memory: 65536 bytes
Total amount of local/shared memory per block: 49152 bytes
Maximum group size (# of threads per block) 1024
Maximum item sizes (# threads for each dim) 1024 x 1024 x 64
Clock rate: 1.62 GHz
ECC support: No
Device fission into equal partitions: No
Device fission by counts: No
Device fission by affinity: No
Maximum subdevices from fission: 1
Shared memory system: No
Subgroup support: No
Shuffle support: Yes

akohlmey · September 16, 2021, 1:39pm

You should try to compile with -DGPU_API=cuda for that GPU.
There are a few subtle differences between the two setups and they seem to make a difference here.