Possible memory problem with Reaxff when the total atom number increased

Kelvin · January 27, 2025, 11:57am

Hello all,

I am trying to run a Reaxff MD simulation with LAMMPS. The simulation box is a simple Al2O3 orthogonal box and the initial configurations are read from a separate data file.

The simulation runs successfully with 30,000 atoms in the box. However, when the total number of atoms is increased to 90,000, the simulation terminates after submission. The error appears to be related to memory allocation.

I have tried adjusting the safezone and mincap parameters, but I find it challenging to tune these settings effectively. Unfortunately, my attempts have not resolved the issue.

I am using a precompiled executable with version 19 Nov 2024. The command used to run LAMMPS is env OMP_NUM_THREADS=56 lmp -sf omp -in in.lammps.

I have attached the files I used for the simulation. Could you please provide guidance on how to solve this problem? Is there any system size limitation for the ReaxFF simulation with LAMMPS?

Any help would be greatly appreciated!

matsci_question.zip (1.2 MB)

akohlmey · January 27, 2025, 12:26pm

There are limitations, but they require much, MUCH larger systems.
You can easily verify that more than 90000 atoms are possible, by inserting a replicate 2 2 2 command into the input right after the read_data command in the examples/reaxff/FC folder.

While the 30000 atom simulation does not crash, its thermo output is highly suspicious with its unusually large fluctuations. That indicates that some of your simulation settings are problematic or the potential parameters not suitable for your geometry. Upon closer inspection, it seems that your “units” setting is wrong. You are using “units metal”, but all ReaxFF force field files that I know require “units real”.

stamoor · January 27, 2025, 6:43pm

You could also try the KOKKOS version which doesn’t use the safezone , mincap , and minhbonds factors which can bloat the memory if you set them too high.

Kelvin · January 28, 2025, 6:41pm

Thank you very much for your guidance!

Following your suggestions, I tried the example examples/reaxff/FC with replicate 2 2 2. The system contains about 140k atoms and runs well for about 2000 steps.

I also corrected the mistake with the units, changing them to units real, as you suggested. The temperature fluctuation for the 30,000 atoms system now seems normal. Thanks very much for pointing out my mistake. I am very sorry I did not notice that before.

However, even after making the same correction for the system with 90,000 atoms, the simulation still terminates shortly after submission. I experimented with several combinations of safezone and mincap, but the following errors occurred:

safezone 20 mincap 100, error: Failed to allocate 119293735680 bytes for array list:three_bodies
safezone 20 mincap 50, error: Failed to allocate 119293735680 bytes for array list:three_bodies
safezone 30 mincap 500, error: Failed to allocate 178940603520 bytes for array list:three_bodies
safezone 40 mincap 100, safezone 40 mincap 500 and safezone 50 mincap 100, error: /var/spool/torque/mom_priv/jobs/119033.master.SC: line 20: 23748 Killed

I also tried using KOKKOS as the following reply suggested, but only with a single core (since the precompiled executable doesn’t support multithreading with KOKKOS enabled). The simulation can be initiated, but it runs very slowly, with little progress obtained so far.

It seems that as the number of atoms in the system increases, there is some problem with memory allocation. Could you please suggest any further solutions to solve this issue?

Thank you again for your help!

log.30000atoms (30.3 KB)
in.90000atoms (1.5 KB)
log.90000atoms.kokkos_1core (5.1 KB)

akohlmey · January 28, 2025, 6:51pm

Your values for safezone are ridiculously large. Try 1.2, 1.5 and not more than 2.0.
The failure to allocate are direct consequences of your unreasonable choices.

Also, this setting should only be required if you system changes a lot. In that case you can always break down the equilibration part of your simulation into multiple shorter runs.

The following two archives, which are the most recent pre-compiled LAMMPS binaries for the stable and the development version, both have KOKKOS with OpenMP enabled included.

https://github.com/lammps/lammps/releases/download/stable_29Aug2024_update1/lammps-linux-x86_64-29Aug2024_update1.tar.gz

https://github.com/lammps/lammps/releases/download/patch_19Nov2024/lammps-linux-x86_64-19Nov2024.tar.gz

Kelvin · January 28, 2025, 6:51pm

Thank you very much for your reply!

I tried using KOKKOS as you suggested. Since I am using the precompiled executable, which doesn’t support multithreading with KOKKOS enabled, I used only one core with the KOKKOS version by running:

env OMP_NUM_THREADS=1 lmp -kokkos on t 1 -sf kk -in in.lammps

Since it’s easier to start with the precompiled executable, I’m currently trying to find a solution within these constraints. If I eventually need to compile LAMMPS with the KOKKOS package enabled on this cluster, I may need additional assistance from the machine’s experienced admin.

Could you kindly provide further suggestions for solving this issue? Thanks very much in advance!

stamoor · January 28, 2025, 7:28pm

If you are running out of memory, you either need to use more nodes to get more memory, or something is wrong with your simulation.

I would suggest trying this benchmark input and see how many atoms you can run before you run out of memory: lammps/examples/reaxff/HNS at develop · lammps/lammps · GitHub. For reference I can run up to 1 million atoms on 16 GB of RAM on a single V100 GPU.

Also you should be using multiple MPI ranks instead of OpenMP threads if possible to speed up the simulation, including for the KOKKOS version.

Kelvin · January 29, 2025, 4:55am

Hello,

Thank you very much for pointing out the issue! That is really helpful.
I have adjusted the value of safezone as you suggested, and the simulation with 90,000 atoms now seems to run well. I am very sorry for the mistake—I should have read the doc page more carefully. It appears that the setting are normally safezone 1.6 mincap 100.

I also tested the pre-built binaries for KOKKOS. It seems that the one lammps-linux-x86_64-29Aug2024_update1.tar.gz includes KOKKOS with OpenMP support, whereas the one lammps-linux-x86_64-19Nov2024.tar.gz does not. When using the latter, I encountered the following error:

LAMMPS (19 Nov 2024)
KOKKOS mode with Kokkos version 4.4.1 is enabled (src/KOKKOS/kokkos.cpp:72)
ERROR: Multiple CPU threads are requested but Kokkos has not been compiled using a threading-enabled backend (src/KOKKOS/kokkos.cpp:205)
Last command: (unknown)

When the -h tag is used for the information, for the KOKKOS part, it shows:

KOKKOS package API: Serial
KOKKOS package precision: double
Kokkos library version: 4.4.1

Maybe KOKKOS with OpenMP needs to be included in the 19Nov2024 patch.

Kelvin · January 29, 2025, 5:02am

Hello,

Thank you very much for your helpful suggestions!

It seems that the issue was caused by the safezone value in my previous script being unreasonably large. I have changed it to 1.6, and now the simulation appears to be running properly.

I appreciate your advice very much! I will try the example you mentioned, it will be extremely helpful for me to better understand how to do a ReaxFF simulation with LAMMPS.

Additionally, I will try compiling LAMMPS with KOKKOS included at a later time.