Memory problems with user-OMP Threading

Derek_Ashley_Thomas · January 31, 2013, 8:10am

Hello all,

I hit a problem and wanted to ask some advice. I do not have an extensive background in OpenMP, so I figured I may be able to see if the following problem is unique or if there may be a way to resolve it.

I am running some benchmarking simulations on several systems for comparison One system is causing trouble, it has queues with a range of quad-core cpus. My test ran minimization, NPT, and NVE simulations for 1000 steps. I used both 3200 atom adn 130,000 atom structures with the AIREBO potential. I wanted to gauge the affect of increasing the number of cpus (4, 16, 32, …) and increasing OpenMP threading (1, 2, 4) hybridization. Perhaps I shouldn’t run 4 OMP threads on quad-core cpus, but it seems like a logically test, which ran particularly fast in some instances.

I found however, that once the number of cores exceeded 32, I would have a realloc error thrown while organizing atoms at the beginning when using 4 OpenMP threads. When I increased the number of cores to 128, I would have the same error with 2 OpenMP threads, so I was stuck with only having 1 thread. I am aware that increasing the number of cores can create greater memory strain because of allocating the memory to all of the processes, but I had expected memory usage to decrease with increasing OMP threading. Am I mistaken?

# Example of the important part of the error for 4-omp-threads and 32-mpi-processes (128 CORES)
ERROR on proc 25: Failed to reallocate 1395552 bytes for array atom:f (memory.cpp:66)

I was also surprised that this error can actually resolve it self by repeatedly attempting (<5 times) until the simulation begins and continues without failure, while being careful to avoid infinite loops ;). I found this out because I first used the python timeit module and found that a few iterations actually ran the simulation without error. Is that expected? This makes me wonder if memory management on this system is volatile (or some libs are poorly built) since I can accidentally jam a simulation through when normally it would throw an error.

The problem is not the input files because they work well on all other systems. Is this something that happens on some systems in particular? Perhaps there are some ways to build with better memory management, or maybe I am stuck. This is using the Intel icc compiler.

If anyone has experience with this sort of problem I would be interested in hearing any thoughts on resolving it, or it may simply be a common problem with the memory allocation from having too many MPI processes.

Best Regards,

Derek Thomas

akohlmey · January 31, 2013, 9:10am

Hello all,

hello derek,

I hit a problem and wanted to ask some advice. I do not have an extensive
background in OpenMP, so I figured I may be able to see if the following
problem is unique or if there may be a way to resolve it.

let us see.

since OpenMP support in LAMMPS is latched onto a code that has been
designed around being very efficiently parallelizable with MPI, it is
a bit difficult to give good advice on how to use it well. the biggest
problem in providing good written help is, that to me (as the
programmer of that code) many things are blatantly obvious, but it is
difficult to fathom which of those need to be explained in detail and
which explanations are more confusing than helpful. you could help
yourself, your fellow LAMMPS OpenMP users and me in improving that
written description or even coming up with a little tutorial
describing the necessary steps in a language that is accessible to
people without extensive technical knowledge.
if you are interested (i certainly am) we can take the discussion
off-list and then later provide the resulting document to steve for
inclusion into the LAMMPS documentation.

I am running some benchmarking simulations on several systems for comparison

ok.

One system is causing trouble, it has queues with a range of quad-core cpus.
My test ran minimization, NPT, and NVE simulations for 1000 steps. I used
both 3200 atom adn 130,000 atom structures with the AIREBO potential. I
wanted to gauge the affect of increasing the number of cpus (4, 16, 32, ...)
and increasing OpenMP threading (1, 2, 4) hybridization. Perhaps I shouldn't
run 4 OMP threads on quad-core cpus, but it seems like a logically test,
which ran particularly fast in some instances.

4 threads should be fine. the current OpenMP support is written to be
very effective at a small to moderate number of threads and picks up
overhead with an increasing number of threads. the cutoff is around
6-10 threads. beyond that, a different threading strategy, more
similar to what is done for GPUs would be more efficient. it depends a
lot on the specific hardware.

I found however, that once the number of cores exceeded 32, I would have a
realloc error thrown while organizing atoms at the beginning when using 4
OpenMP threads. When I increased the number of cores to 128, I would have
the same error with 2 OpenMP threads, so I was stuck with only having 1
thread. I am aware that increasing the number of cores can create greater
memory strain because of allocating the memory to all of the processes, but
I had expected memory usage to decrease with increasing OMP threading. Am I
mistaken?

yes. the memory info that LAMMPS prints is the *per MPI process*
memory, not total memory. with OpenMP the memory use will *always*
have to go up, since individual threads will have to work on local,
per thread memory, if only on the stack. that stack memory will not be
shown in the LAMMPS memory output, which only displays large
allocations, i.e. is a lower limit of how much memory is allocated
(which can be different from how much is used, too). with more threads
this memory use will go up, since some large storage areas, e.g. the
storage for forces will be multiplied by the number of threads, since
each thread is working on its own copy of the force array and only at
the end of the force computation, there will be a reduction of all
force data into the storage for the first thread, which coincides with
the storage for serial runs (this was needed to keep the newton's 3rd
law optimization and have only minimal overhead for avoiding the race
condition when concurrent threads would update the forces).

# Example of the important part of the error for 4-omp-threads and
32-mpi-processes (128 CORES)
ERROR on proc 25: Failed to reallocate 1395552 bytes for array atom:f
(memory.cpp:66)

I was also surprised that this error can actually resolve it self by
repeatedly attempting (<5 times) until the simulation begins and continues
without failure, while being careful to avoid infinite loops ;). I found
this out because I first used the python timeit module and found that a few
iterations actually ran the simulation without error. Is that expected? This

no. this is not expected, this hints at a cluster where the stack
limits are not properly set. machines using RHEL tend to have very low
default settings with using PBS/Torque as resource manager, it can
happen that larger limits are not properly propagated.

makes me wonder if memory management on this system is volatile (or some
libs are poorly built) since I can accidentally jam a simulation through
when normally it would throw an error.

this is very unlikely.

The problem is not the input files because they work well on all other
systems. Is this something that happens on some systems in particular?
Perhaps there are some ways to build with better memory management, or maybe
I am stuck. This is using the Intel icc compiler.

the intel compilers (you'd rather be using the c++ frontend icpc, btw)
are known to be more greedy in terms of stack requirement, and - as i
mentioned before - the way how multiple threads work in general put a
higher demand on using stack space (for local variables), so what may
work without threading, could be a problem with threads.

If anyone has experience with this sort of problem I would be interested in
hearing any thoughts on resolving it, or it may simply be a common problem
with the memory allocation from having too many MPI processes.

no.

one more thing to consider is processor/memory affinity. with today's
NUMA systems, there is a significant performance benefit to retain
processes on the same CPU cores. with MPI only, this is simple, since
MPI is a "share nothing" scheme, where you can just tie each MPI
process to a single core and be done. with OpenMP, this gets a bit
complicated, because you want to have all threads be located at least
on the same socket, better within cores that share as many caches as
possible, and you don't want threads belonging to the same MPI process
be spread across multiple sockets, as that would reduce the available
memory bandwidth and make caching less efficient.

hope this helps and let me know, if you have any additional questions.

axel.

Derek_Ashley_Thomas · February 1, 2013, 8:59am

Hello Axel,

akohlmey · February 1, 2013, 10:09am

Hello Axel,

Thanks a lot for your answers. I'll send you a separate email off-list to
discuss a possible writeup for better understanding the USER-OMP package.

yes. the memory info that LAMMPS prints is the *per MPI process*
memory, not total memory. with OpenMP the memory use will *always*
have to go up, since individual threads will have to work on local,
per thread memory, if only on the stack. that stack memory will not be
shown in the LAMMPS memory output, which only displays large
allocations, i.e. is a lower limit of how much memory is allocated
(which can be different from how much is used, too). with more threads
this memory use will go up, since some large storage areas, e.g. the
storage for forces will be multiplied by the number of threads, since
each thread is working on its own copy of the force array and only at
the end of the force computation, there will be a reduction of all
force data into the storage for the first thread, which coincides with
the storage for serial runs (this was needed to keep the newton's 3rd
law optimization and have only minimal overhead for avoiding the race
condition when concurrent threads would update the forces).

I see. So if the memory shown in the error is not actually the total memory
being used wouldn't it be good to output that as part of the error if it canyou
be calculated? Since an error (not a warning) is grounds to shut down
lammps, it might be good to show a calculation of the entire memory usage
per stack. I am not sure if that is possible, but it would be good
especially in the error given at `memory:66` and other memory related
errors.

this is not easy to do. especially not on linux machines with standard
settings. there are two reasons for that:
1) linux uses an "optimistic" malloc that manages larger chunks of
memory through using a mmap() with copy-on write. that means memory
isn't physically allocated unless it actually used. this memory
overcommitment allows you to run much bigger problems or have more
users run applications, because there are a lot of applications that
allocate memory for features "just in case", but don't use them. for
typical desktop use that is very convenient, increases productivity
and saves money (for RAM). this feature can be turned off, and the
odds are that is what happens on the machine you are running on.
2) you have multiple types of allocatable memory: stack, heap, and
files/devices that are mapped into the address space. secondly,
neither of those are forcibly linked to physical memory (unless it is
"pinned"), but they all count towards address space. so it is not
straightforward to tell, if you run out of address space, virtual
memory or until a user limit.

anything that one could add here would be machine specific. the best
you can do is to learn how to tell which limit you are running into on
your own, since it is very machine and setup specific.
one thing that you can look at is the output of the command: "ulimit
-a" which tells you the current limits. you can launch it from inside
lammps using: shell "ulimit -a"
and another thing is to query the status for your current process
using this command: shell "grep ^Vm /proc/$PPID/status"
the latter will give you an output like this:
VmPeak: 201624 kB
VmSize: 201624 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 7764 kB
VmRSS: 7764 kB
VmData: 5176 kB
VmStk: 140 kB
VmExe: 10904 kB
VmLib: 10944 kB
VmPTE: 356 kB
VmSwap: 0 kB

with the meaning:
VmPeak: 201624 kB largest amount of address space in use for the
lammps process
VmSize: 201624 kB current size of address space in use for the
lammps process
VmLck: 0 kB locked (fixed physical location) memory
VmPin: 0 kB pinned (not allowed to be swapped out) memory
VmRSS: 7764 kB resident set size (actively used physical memory)
VmData: 5176 kB allocated data (heap) size
VmStk: 140 kB allocated stack space
VmExe: 10904 kB address space used for executable code
VmLib: 10944 kB address space used for shared libaries
VmSwap: 0 kB amount of memory swapped out

in combination with the ulimit output and the error message from the
failed realloc call
you should be able to track down which limit you are hitting. in some
cases, you can raise the limits. there is often a "soft" limit
configured (to avoid that people unknowingly use a lot of resources)
and that can be raised until you read a "hard" limit.

no. this is not expected, this hints at a cluster where the stack
limits are not properly set. machines using RHEL tend to have very low
default settings with using PBS/Torque as resource manager, it can
happen that larger limits are not properly propagated.

I'll add another point of data. 128 MPI process works with 2 OMP threads ran
without problems as opposed to 64*2 crashing. This seems rather strange to
me.

no. with 64 processes, assuming that you are running the same system
you need more address space per process.
this can make a huge difference if you run in a 32-bit environment,
where the kernel can manage tens of gigabytes of RAM through page
table extensions, but each process has a limit of 2GB minus epsilon
for the address space.

Thanks for your thoughts. I may be straying very close to the stack limits,

from re-reading your original post, it doesn't look like a stack limit
but an address space limit and that is almost guaranteed to be imposed
through the system configuration that you are running on. it all
depends on what it is exactly that you are doing in your lammps input
script. to get a more accurate appraisal of your situation, you'll
have to provide this.

but it worries me. When the program does actually start, it does not seem to
crash so I clearly have not overflowed the stack as far I can see. I'll see
if I can come up with a simple example to reproduce this independent of
LAMMPS. Then I'll ask my sys-admin about it.

at this point you have understand a little bit about the memory
management strategy inside LAMMPS. some of it has its roots in the
idiosyncrasies of the operating systems of rather old school
supercomputers from cray:
LAMMPS uses a domain decomposition with the atom data being
distributed across the MPI processes. each process "owns" a domain and
the atoms in it and some additional atoms that it needs to compute the
interactions of the owned atoms with its environment. those atoms are
call "ghosts". there is a common storage (it is always more efficient
to allocate memory in one large continuous chunk) for those and its
size is denoted by the variable atom->nmax. while the simulation is
going on, atoms move between domains and the storage needs for each
domain changes. however, for efficiency reasons (allocating, copying
and freeing storage takes time and in the old times it took huge
amounts of times on some machines), LAMMPS will only increase the
storage, never reduce it. this way, after a while no more allocations
are needed. classical MD usually needs rather little memory, so this
is not a big deal (not like quantum chemistry).

in short. it looks like you are running on a machine with limitations
(physical or via configuration) and you seem to be running some input
that requires changes of the storage requirements, which can differ
due to changes in the domain decomposition and how atoms are
distributed across processors.

> makes me wonder if memory management on this system is volatile (or some
> libs are poorly built) since I can accidentally jam a simulation through
> when normally it would throw an error.

this is very unlikely.

I thought so.

the intel compilers (you'd rather be using the c++ frontend icpc, btw)
are known to be more greedy in terms of stack requirement, and - as i
mentioned before - the way how multiple threads work in general put a
higher demand on using stack space (for local variables), so what may
work without threading, could be a problem with threads.

Interesting, maybe there is a way to compile with less greedy memory

you can try switching to gcc or not use OpenMP. but both may lower the
performance.

allocation. Also, I'm not sure if icpc has much of an advantage over icc. I
still seem to need to add the linker flag `-lstdc++`, which seems strange to

you don't. icpc includes all implied libraries if it is used for linking.

me. That's the reason I went with icc instead.

one more thing to consider is processor/memory affinity. with today's
NUMA systems, there is a significant performance benefit to retain
processes on the same CPU cores. with MPI only, this is simple, since
MPI is a "share nothing" scheme, where you can just tie each MPI
process to a single core and be done. with OpenMP, this gets a bit
complicated, because you want to have all threads be located at least
on the same socket, better within cores that share as many caches as
possible, and you don't want threads belonging to the same MPI process
be spread across multiple sockets, as that would reduce the available
memory bandwidth and make caching less efficient.

Thanks for the idea, I'll look to see what this does. The system in question
actually attempts to optimize hybridized OMP-MPI calculations already using
a proprietary program aptly named `hybrid`.

that reminds me. i've been told about a similar tool written by some
people i met a few months ago and i'll have to look it up and try it
out. might be a good piece of advice to add to how to run OpenMP
efficiently in LAMMPS.

cheers,
axel.