HPC and ReaxFF

mohammad_ashajer · April 27, 2016, 2:29am

Dear Aidan
Special thanks about your experence.
Each node containing 28 gig memory (112 gig total). I have verified the memory usage during simulation running when I when i was succeed to ran the simulation for a few steps. Each node used about 22 gig memory. My experience in using 800000 atoms in this simulation (at relaxation step) say the memory usage will be remain almost constant.
Is it true?
Thanks
Mohammad

akohlmey · April 27, 2016, 12:30pm

as for most of such questions, the only suitable answer is: it depends!

the LAMMPS developers work very hard to ensure that LAMMPS has no memory
leaks (well, none that are incremental per time step), as those would
render simulations with millions of steps a big problem. however, there are
parts of LAMMPS where memory is dynamically allocated according to the
needs, and that amount may change over time, if the overall structure of
the system changes. so if you have a system that is shrinking or a system
that is segregating into clusters, you will have an increase of the number
of atoms due to the growth of the neighbor lists and possibly increase of
the number of ghost atoms. but the opposite can be true as well.

would you mind posting the performance summary (that is the part starting
from the "Loop time" line to "Total wall time") from a log file of a large
simulation that doesn't crash. i am curious to to see what kind of
performance you get on your machine. windows based clusters are extremely
rare.

here is an example of that kind of output i am asking for:

Loop time of 1.77568 on 4 procs for 10000 steps with 200 atoms

Performance: 2432866.370 tau/day, 5631.635 timesteps/s
99.7% CPU use with 4 MPI tasks x no OpenMP threads

akohlmey · April 27, 2016, 12:37pm

Dear Aidan
Special thanks about your experence.
Each node containing 28 gig memory (112 gig total).

based on the error messages that you have reported, your problem is not
caused by total memory consumption, but by the memory requested for a
single allocation of a (large) list used internally in the USER-REAXC code.
with the current code, these individual memory allocations are restricted
to 2GB. even if you still have sufficient RAM, as single request that is
over will kill the calculation.

since i do not have a real windows machine (hell, the windows binaries are
compiled on Linux with a cross compiler and never get to touch a windows
machine until they get installed), this is difficult to track down. there
are some obvious changes that can be made to the USER-REAXC code to
mitigate that kind of problem, but without being able to reproduce it, it
will be extremely tedious to resolve. the problem is that the changes are
straightforward, but it is difficult to asses where they need to be made.
if we miss one single spot, then nothing is fixed.

axel.

akohlmey · April 27, 2016, 8:57pm

Dear Aidan
Special thanks about your experence.
Each node containing 28 gig memory (112 gig total).

based on the error messages that you have reported, your problem is not
caused by total memory consumption, but by the memory requested for a
single allocation of a (large) list used internally in the USER-REAXC code.
with the current code, these individual memory allocations are restricted
to 2GB. even if you still have sufficient RAM, as single request that is
over will kill the calculation.

i've made a test on a local 64-bit linux machine with ~500GB RAM and was
able to run a simulation with over 3 million atoms with a single processor
(mind you, that one time step would take half an hour this way).

that narrows down significantly the number of possible locations where the
32-bit overflow could happen. as this has to be with a data type where
linux and windows differ inside the USER-REAXC source code. i've identified
three such locations and applied a possible correction.

there are new windows installer packages with this change included
available for download. please give them a try.

axel.