Dear all,
I have encountered a problem when increasing the number of cores
in my simulation. I ran the same input script on 64, 512, 2048,
and 8192 cores. The simulations worked fine for 64, 512, and 2048 cores,
but on 8192 cores the simulation did not pass the setup until the simulation
was killed by the scheduling system.
do you have the message from the scheduler, indicating the reason?
The system I am running is a bulk system of water with pbcs. The number of
atoms is approx. 5 Mio. The box lengths are 800 x 80 x 800 Angstroms^3.
Could the problem be related to the shape of the box?
unlikely.
Has anyone encountered similar problems before?
yes. not sure if this is still and issue, but a while back, many MPI
libraries for infiniband networks used pinned memory (i.e. memory with
a fixed relation between address space and physical address, which
makes it non-swappable) in O(N**2) fashion (with N being the number of
total MPI tasks), because of better performance for several
benchmarks. at over 512 cores this amount of inaccessible RAM would
get close to 1GB per core and thus forcing applications to swap
excessively or crash due to exceeding the preconfigured per process
memory limits. the solution was to set an option or environment
variable to use a "shared request queue" (SRQ) instead, which has O(N)
memory consumption.
Does anyone have an idea of where I should start to search for the reason
of the problem?
the first place to look is the logs of the batch system and the second
option is to ask the system manager of the cluster for assistance.
unless you accidentally hit a case, where lammps tries to allocate
memory based on an uninitialized variable (which can usually be
identified with valgrind), there is very little in LAMMPS, that could
cause problems.
the second, less likely case, would be if you have domains that
contain no atoms. a while back, we did some careful testing for a lot
of features, to make them compatible with this, but some changes to
the contrary may have crept in. quite a few features in LAMMPS are
developed for dense system and not tested on sparse simulations,
causing the occasional surprise. given the increasing popularity of
LAMMPS, these things are less likely to go unnoticed for inputs that
don't use any "exotic" features, tho.
HTH,
axel.