Problem with a loop in the input script

Dario_Marrocchelli · February 10, 2014, 5:03pm

Dear Lammps users,

I am trying to run the following calculation (see input file below). Basically I need to remove one atom from my simulation and optimize the structure of my system. I repeat this calculation 6038 times by using a loop variable. Everytime, I am removing a different atom. The atom ID is read from a file called id.dat.

This calculation runs fine for a certain number of steps (in the range of 50-2000) then it stops outputting any file. The calculation is still “running”, but it simply does not update the output files. I have tried lamps on different computing clusters and I always get the same problem. If I reduced the number of steps in my loop, then the calculation runs fine.

Is there a limit to the number of steps I can have in a loop? Has anyone experienced this problem already?

Thank you very much!

Dario

sjplimp · February 11, 2014, 4:57pm

Does the looping just stop suddenly, at some non-reproducible
iteration count?
Or does it slow down and grind to a halt?

Can you check if you are leaking memory by running “top”

in another window while the looping is going on.

I can run the attached script (on 4 procs in a few seconds)
for 10K iterations, using the attached tmp.ids file as well.

In your script I don’t get the head/tail shell logic. You
can have a file-style variable simply read successive
lines from a file, as in my script, so not sure why you are doing that.

Also note that variables are not “cleared” by the clear command

(see the doc page), so I’m not sure what will happen when
you have this line in your loop:

variable number file tail.dat

for a file that is changing every iteration. It probably does nothing
to the already created variable, i.e. not change its file pointer.

Which seems bad if you are changing the file via the shell commands.

Steve

in.loop (624 Bytes)

tmp.ids (24.9 KB)

Dario_Marrocchelli · February 12, 2014, 9:43pm

Dear Steve,

Thanks for the reply!

The looping stops suddenly at a certain (usually non-reproducible) step. The step finishes and the next one simply doesn’t start for some reason. It is a sudden stop, not a slow down.

I have not been able to check the memory leaking thoroughly, as I am running in on a big cluster and I cannot access the nodes where my calculation is running. I have monitored it for some runs on a local cluster here at MIT and it seems fine; I use 4-8 Gb out of 32 available. The runs I have monitored, however, are “good” runs, as they haven’t hung so far.

You are right that my head/tail logic was complicated and awkward. I simply did not realize that you can “next" a file variable in lamps. I have changed the script following the one you sent me (attached). I managed to get one run to complete with this script, tough some others have hung. So there is an improvement, but the problem is not completely solved…

One final comment: I am running some pretty large calculations. I am simulating ~200,000 ions interacting with a Buckingham potential (i.e. Coulomb term). I run these calculations on many processors (~100-500). I don’t know if the size of the size introduces extra complications, compared to your simpler lj case.

Any other suggestion?

Thanks again,

D

STO.in (1.6 KB)

_Michael_Murphy · February 13, 2014, 3:31am

I am just mentioning this in case it is related.

I have been experiencing a similar problem with my simulations but under a different circumstance. I tend to just restart my simulations assuming they run long enough because it doesn’t always occur. I had assumed it was a personal issue since I hadn’t seen anyone else mention anything regarding it and it didn’t show up until I did a new compile of LAMMPS (lmp_shock_9Aug13).

As for the the details of my problem, I use the every command to reset the kspace grid while using NPT. Eg.
fix NPT command

run 100000 every 1000 “”

I haven’t done much to troubleshoot, but I noticed that whenever it does get stuck it is always when it is at the 1000 mark when it recalculates.

Michael

akohlmey · February 13, 2014, 4:39am

guys,

would you mind trying out the attached patch (relative to 12 Feb git version)?

there has been a file handle leak in the input processing that should
be plugged by the changes.
the fix has been in LAMMPS-ICMS for a bit, but didn't yet make it to
the upstream version.

thanks,
axel.

lammps-input-fd-leak-fix.diff.gz (893 Bytes)

Dario_Marrocchelli · February 14, 2014, 5:07pm

Dear All,

Indeed, it seems that the problem affects only the latest release of lammps. I am now running these calculations with the previous one (Sep 2013) and those have completed alright!

Thanks!

Dario

akohlmey · February 14, 2014, 5:23pm

Dear All,

Indeed, it seems that the problem affects only the latest release of lammps.
I am now running these calculations with the previous one (Sep 2013) and
those have completed alright! '

can you please produce and post an as simple as possible test case,
i.e. by reducing problem set size and removing any parts of input that
are unrelated. it doesn't have to be a physically meaningful
calculation, only needs to reproduce the issue as quickly as possible,
so it can be used for debugging.

thanks,
axel.

_Nigel · February 16, 2014, 10:31pm

Dear Steve,

Thanks for the reply!

The looping stops suddenly at a certain (usually non-reproducible) step. The step finishes and the next one simply doesn’t start for some reason. It is a sudden stop, not a slow down.

I have not been able to check the memory leaking thoroughly, as I am running in on a big cluster and I cannot access the nodes where my calculation is running.

Ask your admins to install PADB (http://padb.pittman.org.uk/) then you can monitor your job as it is running, even through a batch system.

akohlmey · February 16, 2014, 11:41pm

From: Dario Marrocchelli
Sent: Wednesday, February 12, 2014 9:43 PM
To: Steve Plimpton
Cc: [email protected]
Subject: Re: [lammps-users] Problem with a loop in the input script

Dear Steve,

Thanks for the reply!

The looping stops suddenly at a certain (usually non-reproducible) step. The
step finishes and the next one simply doesn't start for some reason. It is a
sudden stop, not a slow down.

I have not been able to check the memory leaking thoroughly, as I am running
in on a big cluster and I cannot access the nodes where my calculation is
running.

Ask your admins to install PADB (http://padb.pittman.org.uk/) then you can
monitor your job as it is running, even through a batch system.

i would consider this a measure of last resort. it only makes sense,
if the size of the problem is part of the problem. it rarely is,
though. steve and i regularly track down all kinds of issues easily
when people provide us with a simple, stripped down input deck. in
some cases, we can do it by ourselves, but having some help there does
not only help us to help others, but it also is a simple general
procedure to verify a complex workflow. this can then be easily and
quickly run and debugged on a multi-core workstation and then
transferred into a production calculation.

one additional issue to warn about, is the excessive use of the
"shell" command. large clusters often use infiniband as interconnect
and there are some restrictions to applications using the "system"
c-library function in executables. it is better to be avoided as much
as possible.

axel.

Dario_Marrocchelli · February 18, 2014, 8:32pm

Dear Axel and all,

I managed to log in to the nodes where the job was running when it hung. Even when the job was hung, it was only using 15% of the available RAM memory, so I don’t think this is a memory leak problem.

Axel, I have tried what you suggested: produce an as simple as possible test case for you guys to run. The problem is that the simple case runs just fine! However, I think I have narrowed down the cause of the problem. When I use a simple test case (see input file below), it runs fine. However, if I run the same calculation, but starting from a restart file, then it hangs! I also attach the input script for the latter run. So the problem is somehow related with the read_restart option. Also, remember I have this problem ONLY with the latest release of Lammps.

If you want to reproduce the problem, you can use the first input file to generate a restart file and the second to run the “problematic” calculation.

Hope this helps,

Dario

Simple run input file