I am trying to help a user resolve a strange problem
with lammps. I can run his model on up to 64 cores, but
it always fails with some version of the subject
message on more cores. This is identical to the problem
posted by Haofei Zhou on 11 May 2010, but the thread
does not lead to a resolution. Both the LAMMPS manual
and the postings suggest this should never happen. As
far as I can tell, nothing involving virtual processors
is being invoked.
We are running LAMMPS-20Feb10 with the Intel 11.1
compiler and MVAPICH2-1.4 on dual quad-core Xeon nodes,
8GB RAM with SDR InfiniBand.
On 4 nodes (32-cores), the output file shows:
... PBS stuff removed ...
LAMMPS (20 Feb 2010)
Reading data file ...
orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
2 by 4 by 4 processor grid
85184 atoms
Setting up run ...
Memory usage per processor = 512.909 Mbytes
Step Temp KinEng PotEng TotEng
0 450 114261.45 34127167
34241428
..................
... clipped ...
..................
Exit Status:
Job ID: 478777.qb2
Username: jalupo
Group: loniadmin
Job Name: LAMMPS_Test
Session Id: 18976
Resource Limits:
ncpus=1,nodes=4:ppn=8,walltime=00:05:00
Resources Used:
cput=00:00:00,mem=13092kb,vmem=199388kb,walltime=00:05:
06
Queue Used: workq
Account String: TG-STA080000N
Node: qb311
Process id: 19763
On 8 nodes (64-cores), the output file shows:
... PBS stuff clipped ...
LAMMPS (20 Feb 2010)
Reading data file ...
orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
4 by 4 by 4 processor grid
85184 atoms
Setting up run ...
Memory usage per processor = 264.611 Mbytes
Step Temp KinEng PotEng TotEng
0 450 114261.45 34127167
34241428
..................
... clipped ...
..................
Exit Status:
Job ID: 475595.qb2
Username: jalupo
Group: loniadmin
Job Name: LAMMPS_Test
Session Id: 4406
Resource Limits:
ncpus=1,nodes=8:ppn=8,walltime=00:05:00
Resources Used:
cput=00:00:00,mem=22428kb,vmem=275860kb,walltime=00:05:
09
Queue Used: workq
Account String: TG-STA080000N
Node: qb413
Process id: 5258
The memory usage per processor is going down as one
might expect, but mem and vmem usage has gone up.
On 16 nodes (128-cores), the output shows:
... PBS stuff clipped ...
LAMMPS (20 Feb 2010)
Reading data file ...
orthogonal box = (-686.234 -686.234 -686.234) to
(686.234 686.234 686.234)
4 by 4 by 8 processor grid
85184 atoms
Setting up run ...
ERROR on proc 22: Failed to reallocate 1125856 bytes
for array atom:v
ERROR on proc 65: Failed to reallocate 1125856 bytes
for array atom:v
ERROR on proc 57: Failed to reallocate 1125856 bytes
for array atom:v
Exit code -5 signaled from qb048
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 65MPI process (rank: 6
5) terminated unexpectedly on qb170
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 57MPI process (rank: 5
7) terminated unexpectedly on qb183
application called MPI_Abort(MPI_COMM_WORLD, 1) -
process 22MPI process (rank: 2
2) terminated unexpectedly on qb343
... PBS stuff clipped ...
When runs on more than 64 nodes fail, no usage
information is reported. I've checked on 9 (72-cores),
12 (96-cores), and 16 (128-cores) nodes.
Jim