[lammps-users] Failed to Reallocate %d bytes for array %s

I am trying to help a user resolve a strange problem
with lammps. I can run his model on up to 64 cores, but
it always fails with some version of the subject
message on more cores. This is identical to the problem
posted by Haofei Zhou on 11 May 2010, but the thread
does not lead to a resolution. Both the LAMMPS manual
and the postings suggest this should never happen. As
far as I can tell, nothing involving virtual processors
is being invoked.

We are running LAMMPS-20Feb10 with the Intel 11.1
compiler and MVAPICH2-1.4 on dual quad-core Xeon nodes,
8GB RAM with SDR InfiniBand.

On 4 nodes (32-cores), the output file shows:

... PBS stuff removed ...
LAMMPS (20 Feb 2010)
Reading data file ...
  orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
  2 by 4 by 4 processor grid
  85184 atoms
Setting up run ...
Memory usage per processor = 512.909 Mbytes
Step Temp KinEng PotEng TotEng
       0 450 114261.45 34127167
34241428
..................
... clipped ...
..................
Exit Status:
Job ID: 478777.qb2
Username: jalupo
Group: loniadmin
Job Name: LAMMPS_Test
Session Id: 18976
Resource Limits:
ncpus=1,nodes=4:ppn=8,walltime=00:05:00
Resources Used:
cput=00:00:00,mem=13092kb,vmem=199388kb,walltime=00:05:
06
Queue Used: workq
Account String: TG-STA080000N
Node: qb311
Process id: 19763

On 8 nodes (64-cores), the output file shows:

... PBS stuff clipped ...
LAMMPS (20 Feb 2010)
Reading data file ...
  orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
  4 by 4 by 4 processor grid
  85184 atoms
Setting up run ...
Memory usage per processor = 264.611 Mbytes
Step Temp KinEng PotEng TotEng
       0 450 114261.45 34127167
34241428
..................
... clipped ...
..................
Exit Status:
Job ID: 475595.qb2
Username: jalupo
Group: loniadmin
Job Name: LAMMPS_Test
Session Id: 4406
Resource Limits:
ncpus=1,nodes=8:ppn=8,walltime=00:05:00
Resources Used:
cput=00:00:00,mem=22428kb,vmem=275860kb,walltime=00:05:
09
Queue Used: workq
Account String: TG-STA080000N
Node: qb413
Process id: 5258

The memory usage per processor is going down as one
might expect, but mem and vmem usage has gone up.

On 16 nodes (128-cores), the output shows:

... PBS stuff clipped ...
LAMMPS (20 Feb 2010)
Reading data file ...
  orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
  4 by 4 by 8 processor grid
  85184 atoms
Setting up run ...
ERROR on proc 22: Failed to reallocate 1125856 bytes
for array atom:v
ERROR on proc 65: Failed to reallocate 1125856 bytes
for array atom:v
ERROR on proc 57: Failed to reallocate 1125856 bytes
for array atom:v
Exit code -5 signaled from qb048
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 65MPI process (rank: 6
5) terminated unexpectedly on qb170
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 57MPI process (rank: 5
7) terminated unexpectedly on qb183
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 22MPI process (rank: 2
2) terminated unexpectedly on qb343
... PBS stuff clipped ...

When runs on more than 64 nodes fail, no usage
information is reported. I've checked on 9 (72-cores),
12 (96-cores), and 16 (128-cores) nodes.

Jim

jim,

are you enabling shared request queues (SRQ) on the infiniband?
if not, please try it. many infiniband installations by default
use a pinned memory buffer for each other MPI task on each MPI task
and that O(N**2) memory consumption can starve jobs out of memory
when increasing the number of nodes.

cheers,
    axel.

To add to Axel's comment, whenever I've seen this kind of error:

ERROR on proc 22: Failed to reallocate 1125856 bytes

and the memsize is not a negative number (indication some
humongous invalid allocation request), the problem is with MPI,
not LAMMPS. It is somehow locking down memory and not
allowing the running process to allocate even though there
is available space. I saw this more years ago, and assumed
most MPI implementations (e.g. OpenMPI was a culprit) have fixed it.

Steve

I will add that your box size is huge (1400^3) but your
atom count is modest (80K atoms), so if your cutoff
is short, it is possible that the neighbor list is trying
to bin that huge (empty?) space, and running out of
memory, which would be more likely on small numbers
of procs. When it runs successfully on 64 procs,
how much memory does it say LAMMPS is using
per processor?

Steve

a couple of additional remarks:

i assume this is on queen-bee, right?
the you might want to run only on half the cores
or even better yet, upgrade to the multi-threaded
pair styles in my LAMMPS-ICMS branch. we tested
on abe and for large node counts using half the cores
(but still all the cache!!) or switching to MPI+OpenMP
lead to massive speedups of 2x to 4x for the same
node counts for best effort MPI vs MPI+OpenMP
and significantly better parallel efficiency on the MPI
alone by reducing the MPI tasks per node. check out:
http://sites.google.com/site/akohlmey/software/lammps-icms
and particularly the pdfs of the poster and talk at the bottom.

cheers,
    axel.

Thanks for the replies.

I'll have to have one of our systems folks look into
the IB configuration.

We can try recompiling with different MPI's. Besides
MVAPICH2-1.4,
we have OpenMPI-1.3.4 and MVAPICH-1.1 available to
start with.

On 32 cores, the memory user per processor was 512.909
Mbytes.
On 64 cores, it was 264.611 Mbytes.

Memory per node is 8GB.

Jim

Yes, this is on Queen Bee.

Looks like you've given us many options to try.

I appreciate the help!

Jim

Your numbers below mean when you run on 16 procs you willl need about 1Gb
per MPI proc (which is a core). If your system is 8 cores/node
and you only have 8Gb per node, then you will be in trouble.
And the memory size LAMMPS prints out is actually an underestimate,
as it doesn't count everything (e.g. extra memory in 2d arrays for pointers).

Steve

Thanks for the replies.

I'll have to have one of our systems folks look into
the IB configuration.

We can try recompiling with different MPI's. Besides
MVAPICH2-1.4,
we have OpenMPI-1.3.4 and MVAPICH-1.1 available to
start with.

with OpenMPI you can use:

--mca btl_openib_use_srq 1

as option to mpirun to enable SRQ.
on our machines i have this set as
default in openmpi/etc/openmpi-mca-params.conf
i found that this speeds up most applications and
makes OpenMPI competitive or even outrun MVAPICH.

cheers,
    axel.

Steve,

I don't quite follow your logic. The memory per
processor dropped by almost half going from 32 cores to
64 cores.
Going to 128 cores (16 nodes) should see another memory
drop. At 64 cores, 264.611 Mbytes per processor implies
only 2.116GB required on a single node out of 8GB
available. This suggests even less memory per node
going to
128 cores, even accounting for overhead and loose
estimations.

Jim

I thought your earlier message said you were getting
this error on 16 or 32 cores. If you are actually running
on 16 nodes (128 cores) then you are correct, the memory
per core should decrease. So maybe it is the MPI memory
lock-down issue. I would still figure out why
the box is so big with so few atoms.

Steve

I was able to run on up to 14 nodes (112 cores) using
MVAPICH2, LAMMPS-ICMS, and setting an environment
variable: "MV2_USE_SRQ=1", but it continued to fail on
15 or more nodes. I switched over to OpenMPI-1.3.4,
used mpirun option "-mca btl_openib_use_srq 1", and
LAMMPS-ICMS ran fine on 16 nodes (128 cores).

I've passed on the user the issue of the overly large
box.

Thanks for all the help!

Jim