Imbalanced cpu nodes

Dear LAMMPS developers and users,

My solid-state non-equilibrium thermal conductivity simulations have been very consistent in memory usage: they use 2.3 G and are very well balanced across different nodes. Recently, I have had my jobs fail due to “std::bad_alloc”. After tracking the memory usage, I found that two nodes use more memory and the amount increases sharply in time. Would anyone know why this may be?I had something similar happen before and found out that it was because of invoking compute centro/atom too regularly. I am not sure what has caused a similar problem to occur again.

The images below are screen shots of the virtual memory on different nodes after 1.3 hrs, 1.4 hrs, and 1.8 hrs. I’ve also attached the output file.

I appreciate any input I may receive.

Thank you!

image.png

image.png

image.png

ofile.1474957.red-admin.redfin.sharcnet (42.2 KB)

Dear LAMMPS developers and users,

My solid-state non-equilibrium thermal conductivity simulations have been very consistent in memory usage: they use 2.3 G and are very well balanced across different nodes. Recently, I have had my jobs fail due to “std::bad_alloc”. After tracking the memory usage, I found that two nodes use more memory and the amount increases sharply in time. Would anyone know why this may be?I had something similar happen before and found out that it was because of invoking compute centro/atom too regularly. I am not sure what has caused a similar problem to occur again.

the error message suggests, that you are running out of "address
space", i.e. a call using the "new" operator failed.
there are many possible reasons for that. the two most likely are:
1) you are using a feature of LAMMPS that slowly grows its memory use
until you run out
2) you are using a feature of LAMMPS that has a memory leak.

the first thing you can try to resolve this, is to check out the very
latest LAMMPS patch, version 23June2017 and check if the issue
persists.
if yes, then you need to narrow down, which of the two issues it is.
for that, you first should reduce your system size to be *much*
smaller, so one can quickly run it on a single processor within a few
minutes. it need not crash, but you can monitor its memory usage. it
also does not have to be physically meaningful, it just needs to run
all the various commands in a similar fashion.

if you have such an input, you can either try to run it yourself using
the memcheck module of the valgrind software or you can post the
complete input deck here (or on github as an issue) and wait if one
the LAMMPS developers has time to look into it and possibly confirm
whether it is a bug or a feature. for that however, it is crucial,
that your input is really small and runs really fast. none of the
developers has time to have to wait a long time for a simple debug run
to close (running under valgrind makes LAMMPS over an order of
magnitude slower).

axel.

Dear Professor Kohlmeyer,

Thank you for your thorough response. I’ve contacted our system administrates and asked about including the patch you suggested.

In the mean time, for the test runs, how would I distinguish a memory leak from a feature of LAMMPS that slowly grows its memory use until it runs out?

Thank you so much,

Tara

Dear Professor Kohlmeyer,

Thank you for your thorough response. I've contacted our system
administrates and asked about including the patch you suggested.

you don't want just that patch, but use the latest (development)
version of LAMMPS. over the last couple of years, the LAMMPS
developers have used a variety of tools to systematically audit the
LAMMPS source code for a variety of programming issues, and that
included memory leaks. so your version from february 2016 has several
known memory leaks, that have been fixed since. before looking into
this, thus we need to know whether what you are seeing is not caused
by one of those.

In the mean time, for the test runs, how would I distinguish a memory leak
from a feature of LAMMPS that slowly grows its memory use until it runs
out?

you need to do what i suggested, i.e. devise a version of your
calculation, that is much smaller in how much memory it needs, that
starts off a data file (not a restart), and that runs very fast and
over a much smaller number of time steps, yet does all operations that
your current input does. for that you would not need to run on a large
machine, but could just compile a serial version of LAMMPS yourself
directly on your desktop machine.

with only 2650 atoms, you don't have a large amount of force
computation and memory required for that, so it looks to me, that your
main memory consumption might be in the averaging fixes.

with a fast/small input an experienced programmer then can use tools
like valgrind or compiler instrumentation to determine memory leaks.
for that it is usually necessary that a calculation finishes and then
then the total tally of memory allocations and deallocations is
inspected.

but nobody likes to look for bugs that were already found and fixed,
so we first need the confirmation that the unexpected growing memory
use still exists in the latest LAMMPS version.

axel.

Dear Professor Kohlmeyer,

I sincerely appreciate your response. I have learned a lot from it. I’ll follow through your recommendations and post my findings as soon as I get them.

Thank you again!

Tara