Question about MPI error stack

Dat_Ph_m_Dinh · December 13, 2021, 2:42am

Dear Sir,

I am trying to simulate friction at Hydroxylapatite - Titanium interface using reactive force field.

This simulation was carried out normally by our workstation (single CPU).
friction.txt (5.1 KB)
system.lmp (197.5 KB)
reaxff.txt (69 Bytes)
TiO2_ffield.reax (21.3 KB)

In order to accelerate, we are trying to parallel simulation. However, simulation process was stopped because of error stack.
lammps.out (57.0 KB)
lammps.err (715 Bytes)

I have no idea about causes and solution of this problem.
Could you give me some advice in this situation?

Thank you very much.
Best regards,

akohlmey · December 13, 2021, 2:14pm

LAMMPS warns you about atoms being time integrated multiple time. That warning must be taken very seriously. If there would be only a few use cases, where the combination of command is not an error, LAMMPS would just terminate. You need to fix this.
Also, when having errors, it is almost always a good idea to upgrade to the latest version of LAMMPS to confirm that you are not seeing an issue that has already been resolved.

Dat_Ph_m_Dinh · December 20, 2021, 6:15am

Thank you a lot, Sir.
I have revised input, then multiple time integrating was avoided.
Unfortunately, the segmentation fault still be there (as log file). As my understanding, it might be due to not enough allocated memory. Is there any way that I can manage memory more efficient to solve this problem?
I would be grateful if you could give me some advice.

lammps.err (197 Bytes)
lammps.out (85.6 KB)

akohlmey · December 20, 2021, 12:30pm

I recommended to update LAMMPS to a more recent version. This is still run with an older version.
there are “crazy” neighbor list settings that are extremely wasteful like “neigh_modify page” and “neigh_modify one” have crazy large values. there should be no need to modify those for your kind of system
you have thermo_modify settings to ignore lost atoms and bonds. this is very bad since those will hide problems, since for your kind of calculation atoms should not be lost and there are no (explicit) bonds.
the number of atoms per MPI rank is rather small. ReaxFF calculations use a heuristic memory management model that can struggle if the (local) system changes too much. having a small number of atoms increases the risk of being hit by such an issue. this can be reduced by using fewer MPI ranks and restarting between “runs” while the system is still equilibrating.

Dat_Ph_m_Dinh · January 11, 2022, 8:27am

Thank you very much for you patient explanation.
My simulation can be carried out completely after reducing amount of MPIs.

By the way, I am trying to optimize simulation acceleration based on following formula CPU core number = MPI process number P * OpenMP threads number. I have tried to use all available CPU core in a node of supercomputer by keep number of MPI small enough and use OpenMP at the same time. However, it seem not very effective. As mentioned in many discussion, this work seem so difficult because of different memory distribution between MPI and OpenMP.
So, is it a bottleneck in parallel computing that I have no way to use my whole computation resources for a simulation?
Or if there is some effective construction for MPI - OpenMP hybrid method, where a beginner like me should start?
And, is there any other methods that I should consider? I am working on Windows so GPU package was not provided, I though.

It would be great if you could give me some advice.
Best regards,

akohlmey · January 11, 2022, 2:02pm

There are a number of factors to consider. You also need to get some background knowledge to fully understand all the details.

You have to make certain that you have the correct number of CPU cores. Most modern CPUs - especially the more powerful ones - support simultaneous multi-threading (SMT) which Intel usually calls hyper-threading (HT). This allows to increase the CPU utilization by making CPUs look like they exist twice and for as long as those are doing tasks that do not require access to the same circuitry, they can operate in parallel; however that is often only the case for about 10-20% of the time, sometimes less. For applications like LAMMPS that do a lot of math, it is usually on the lesser side and thus SMT/HT is usually turned off in HPC clusters. So if your CPU has SMT/HT enabled and you try to use all those cores, you may get an incorrect expectation of what kind of speedup you can achieve. It is thus often better to stop at using half the available CPU cores.
There are theoretical limits to how much speedup you can get from parallelization. Please look up Amdahl’s Law. It basically says that the maximum parallelization is limited by the amount of time spent on non-parallel code (and every application has that). So if 10% is spent in non-parallelizable code and assuming perfect parallelization, there can - even with an infinite number of processors - only be a 10x speedup at the most. The amount of parallelizable effort in an MD simulation crucially depends on the number of atoms and the number of neighbors of those atoms, since the dominant amount of time is spent looping over pairs of atoms and computing their interactions. Thus it is much easier to speed up a simulation with many atoms than one with few, and much easier to speed up calculations with “expensive” potentials that have complex calculations versus “cheap” potentials and particularly those with few neighbors.
ReaxFF falls into the “expensive” category, but you also have a very small system.
There are practical limits to how well you can speed up specific systems in an MD code. Parallelization is never ideal, like assumed in Amdahl’s Law, but there are issues because parallelization requires additional overhead that contributes to the non-parallel time in the case of LAMMPS that is particularly noticeable in when comparing MPI parallelization versus OpenMP parallelization. LAMMPS has been designed for parallelization with MPI using domain decomposition, which partitions the simulated system geometrically into sub-boxes. OpenMP parallelization is added on top of that later. The whole particle distribution process and neighbor list generation is well adapted and efficient for the MPI parallelization, but because of that, there is only limited parallel efficiency for the OpenMP parallelization, since the data structures and data access patters are optimized for MPI. It would take too much time to explain the details, but basically, the overhead for MPI grows much less than for OpenMP when using more processors.
Beyond the efficiency of the implementation itself, there is the question of the number and distribution of “work units”. In LAMMPS the most time consuming operations loop over atoms. However due to domain decomposition, this is assuming that each sub-domain has the same number of atoms (which is a good assumption for many MD applications as they investigate bulk condensed systems), but if that number is not evenly distributed, then the amount of effort required for each subdomain differs and the sub-domain with the most effort will determine how efficient the MPI parallelization is. This situation can be improved by promoting a domain decomposition that is more efficient using the processors command (e.g. for slab in z-direction and vacuum above and below it will be more efficient to restrict the domain decomposition to use only 1 layer in z-direction). For more complex situations there are also the balance command or fix balance (in case the distribution changes significantly over time). Load balancing uses by default the number of atoms as a guide, but that is only a coarse approximation. For OpenMP the most efficient work unit would be the number of pairs of atoms, but due to limits of the implementation it can only distribute over atoms and those can lead to (minor) load imbalances, since different atoms may have a different amount of neighbors.

Bottom line:

make sure you know how many physical CPU cores you have and in first approximation don’t use the “fake” cores from SMT
then determine whether your system is homogeneous or not and adjust the domain decomposition via the processors command if you have vacuum areas
find the optimal number of MPI processors, i.e. increase (logarithmically) until there is no more significant speedup from adding more MPI processors
now add OpenMP on top of that and see how much speedup you can find. To benefit from OpenMP you must enable OPENMP package styles on the command line or in the input. LAMMPS will always report how many OpenMP tasks were available, but those need not be used. There is a %CPU utilization output, that can be helpful. It should be 100% for a single OpenMP tasks and close to 200% for two OpenMP threads.

The GPU package is included in the Windows binaries (mixed precision for OpenCL), but there is no ReaxFF support for that. GPU acceleration for ReaxFF requires KOKKOS and that can currently not be compiled for Windows, at least not with the cross compilers used to build the precompiled LAMMPS binaries.