Hello all,
hello derek,
I hit a problem and wanted to ask some advice. I do not have an extensive
background in OpenMP, so I figured I may be able to see if the following
problem is unique or if there may be a way to resolve it.
let us see.
since OpenMP support in LAMMPS is latched onto a code that has been
designed around being very efficiently parallelizable with MPI, it is
a bit difficult to give good advice on how to use it well. the biggest
problem in providing good written help is, that to me (as the
programmer of that code) many things are blatantly obvious, but it is
difficult to fathom which of those need to be explained in detail and
which explanations are more confusing than helpful. you could help
yourself, your fellow LAMMPS OpenMP users and me in improving that
written description or even coming up with a little tutorial
describing the necessary steps in a language that is accessible to
people without extensive technical knowledge.
if you are interested (i certainly am) we can take the discussion
off-list and then later provide the resulting document to steve for
inclusion into the LAMMPS documentation.
I am running some benchmarking simulations on several systems for comparison
ok.
One system is causing trouble, it has queues with a range of quad-core cpus.
My test ran minimization, NPT, and NVE simulations for 1000 steps. I used
both 3200 atom adn 130,000 atom structures with the AIREBO potential. I
wanted to gauge the affect of increasing the number of cpus (4, 16, 32, ...)
and increasing OpenMP threading (1, 2, 4) hybridization. Perhaps I shouldn't
run 4 OMP threads on quad-core cpus, but it seems like a logically test,
which ran particularly fast in some instances.
4 threads should be fine. the current OpenMP support is written to be
very effective at a small to moderate number of threads and picks up
overhead with an increasing number of threads. the cutoff is around
6-10 threads. beyond that, a different threading strategy, more
similar to what is done for GPUs would be more efficient. it depends a
lot on the specific hardware.
I found however, that once the number of cores exceeded 32, I would have a
realloc error thrown while organizing atoms at the beginning when using 4
OpenMP threads. When I increased the number of cores to 128, I would have
the same error with 2 OpenMP threads, so I was stuck with only having 1
thread. I am aware that increasing the number of cores can create greater
memory strain because of allocating the memory to all of the processes, but
I had expected memory usage to decrease with increasing OMP threading. Am I
mistaken?
yes. the memory info that LAMMPS prints is the *per MPI process*
memory, not total memory. with OpenMP the memory use will *always*
have to go up, since individual threads will have to work on local,
per thread memory, if only on the stack. that stack memory will not be
shown in the LAMMPS memory output, which only displays large
allocations, i.e. is a lower limit of how much memory is allocated
(which can be different from how much is used, too). with more threads
this memory use will go up, since some large storage areas, e.g. the
storage for forces will be multiplied by the number of threads, since
each thread is working on its own copy of the force array and only at
the end of the force computation, there will be a reduction of all
force data into the storage for the first thread, which coincides with
the storage for serial runs (this was needed to keep the newton's 3rd
law optimization and have only minimal overhead for avoiding the race
condition when concurrent threads would update the forces).
# Example of the important part of the error for 4-omp-threads and
32-mpi-processes (128 CORES)
ERROR on proc 25: Failed to reallocate 1395552 bytes for array atom:f
(memory.cpp:66)
I was also surprised that this error can actually resolve it self by
repeatedly attempting (<5 times) until the simulation begins and continues
without failure, while being careful to avoid infinite loops ;). I found
this out because I first used the python timeit module and found that a few
iterations actually ran the simulation without error. Is that expected? This
no. this is not expected, this hints at a cluster where the stack
limits are not properly set. machines using RHEL tend to have very low
default settings with using PBS/Torque as resource manager, it can
happen that larger limits are not properly propagated.
makes me wonder if memory management on this system is volatile (or some
libs are poorly built) since I can accidentally jam a simulation through
when normally it would throw an error.
this is very unlikely.
The problem is not the input files because they work well on all other
systems. Is this something that happens on some systems in particular?
Perhaps there are some ways to build with better memory management, or maybe
I am stuck. This is using the Intel icc compiler.
the intel compilers (you'd rather be using the c++ frontend icpc, btw)
are known to be more greedy in terms of stack requirement, and - as i
mentioned before - the way how multiple threads work in general put a
higher demand on using stack space (for local variables), so what may
work without threading, could be a problem with threads.
If anyone has experience with this sort of problem I would be interested in
hearing any thoughts on resolving it, or it may simply be a common problem
with the memory allocation from having too many MPI processes.
no.
one more thing to consider is processor/memory affinity. with today's
NUMA systems, there is a significant performance benefit to retain
processes on the same CPU cores. with MPI only, this is simple, since
MPI is a "share nothing" scheme, where you can just tie each MPI
process to a single core and be done. with OpenMP, this gets a bit
complicated, because you want to have all threads be located at least
on the same socket, better within cores that share as many caches as
possible, and you don't want threads belonging to the same MPI process
be spread across multiple sockets, as that would reduce the available
memory bandwidth and make caching less efficient.
hope this helps and let me know, if you have any additional questions.
axel.