[lammps-users] memory issues with LAMMPS and MPI

Hi all -

I am getting more frequent complaints about LAMMPS throwing the
following error when it starts up on various machines:

A realloc() for a modest amount of memory (~1Mb) fails.

A typical symptom is that the same problem runs on fewer
procs, but fails on a larger # of procs, say 128 procs.

Since LAMMPS uses less memory per proc on more procs,
this is an indication to me that it's an MPI problem, maybe
something to do with how realloc() interacts with MPI.

Most recently, I am hearing about this problem with MVAPICH. But I
think it has also happened with OpenMPI.

Any ideas? Could this be an MPI configuration issue? E.g. when you
run on lots of procs, MPI uses a lot of memory to setup buffers to
(possibly) communicate with all those procs? If so, are there
settings in either MVAPICH or OpenMPI to get around this?

Steve

Hi all -

hi steve,

I am getting more frequent complaints about LAMMPS throwing the
following error when it starts up on various machines:

A realloc() for a modest amount of memory (~1Mb) fails.

A typical symptom is that the same problem runs on fewer
procs, but fails on a larger # of procs, say 128 procs.

i would not be surprised if another common denominator is
that this happening on machines with infiniband and particularly
with nodes that have two dual quad code cpus.

Since LAMMPS uses less memory per proc on more procs,
this is an indication to me that it's an MPI problem, maybe
something to do with how realloc() interacts with MPI.

Most recently, I am hearing about this problem with MVAPICH. But I
think it has also happened with OpenMPI.

i've seen this with both of them.

Any ideas? Could this be an MPI configuration issue? E.g. when you
run on lots of procs, MPI uses a lot of memory to setup buffers to
(possibly) communicate with all those procs? If so, are there
settings in either MVAPICH or OpenMPI to get around this?

from what i could find out so far, it seems to be related to using
a remote direct memory access protocol (RDMA). i've seen this happening
first on the NCSA abe machine. it looks as if for bein able to
communicate via RDMA to all nodes you need to allocate for each MPI
task one local non-swappable(!!) block of memory for each MPI task
that it is talking to. this is small for small jobs, but explodes for
really large jobs. so in fact you are running out of memory because
the IB layer is claiming all the memory.

there is not much you can do but add more memory or use less tasks
per node. for dual quad core nodes at high node counts i found that
using half the cores is more efficient (you have double the cache
memory per MPI task which reduces memory bandwith load and less
communication), with OpenMPI you can make it even more efficient by
using processor affinity (e.g. via setting mpi_paffinity_alone = 1
in your ~/.openmpi/mca-parameters.conf file or via command line).

also with OpenMPI you can reduce the load on RDMA buffers by setting
mpi_leave_pinned = 1 (you can check the effectivity of this by setting
mpi_rdma_print_stats = 1). i found that LAMMPS is not as demanding in
this regard as other applications i'm supporting for the people in our
group (for those already 32 MPI tasks on an old myrinet with single
core dual opteron can make you run out of RDMA buffers).

these openmpi settings need openmpi v1.2.3 or later, IIRC.

hope that helps,
    axel.