[lammps-users] puzzling errors

_Chetan_Mahajan1 · October 18, 2009, 5:13am

Dear Lammps Community

I am testing some new jobs (ALL SIMILAR IN NATURE) and I got following errors of which I don’t understand a bit and I would appreciate any input from your side.

C++ runtime abort: terminate() called by the exception handling mechanism

IN above case, the pressure and temperature , just before run failure exit seem to approach wilder values (like pressures from normal 100-200s to 4000s and temperature from normal 300 to 600) although existence of such values is limited to only 2-3 points.

In 2 of my runs, more than 8 GB of output file generated with most of it dealing with following:

Dihedral problem: %d %d %d %d %d %d

Conformation of the 4 listed dihedral atoms is extreme; you may want to check your simulation geometry.

Now such job failed in 1 case an didn’t fail in another. I am sure it failed in previous case, sine disk quota got exceeded. But in second case, while it kept on adding to this huge file, its log file with trajectory data looked fine. Pressure values with some exceptions looked quite ok. Dihedral energies were normal and remained constant. Also if I probed further along the lines of above suggestion in lammps manual, I see that my initial dihedrals (with some exceptions of values near 90) are close to 0, 180 or 60 or so. And thus they are pretty ok. My energy parameters seem ok. So I don’t know where is this coming from.

Now I am not sure if 1 and 2 are really related.

In this case, job ended prematurely with the error of bond atoms missing, although with wild oscillations in pressure, temperature and energy at timestep of 16 lakhs, it continued till 17.7 lakhs timestep with stable trajectory profile in any entity. Now this is something which I should probe on my own. But I am just mentioning this here for the sake of completion and complementary to the above.

Please let me know if anybody has any inputs on anything above.

Thanks in advance

Chetan

sjplimp · October 22, 2009, 1:13pm

All of this (except file sizes and disk quotas) is consistent with
your system blowing up - i.e. losing atoms, dihedrals becoming
bogus. So you need to figure out why that is happening. Printing
out thermo every timestep right before it onsets is a good idea,
as you will likely see temperature or pressure do bad things.

Steve

_Chetan_Mahajan1 · October 23, 2009, 4:58am

HI Steve

This is in reply to this as well as your reply to my earlier email about
memory error, lammps trying to allocate more than 2 GB per proc for a system
of 5000 atoms.

Now I have got 2 similar runs but with 2000-3000 more atoms done very
successfully, without any error.

So my first question is, is there a possibility of any scientific error
possible or I have to focus on system issues.

Interestingly with these jobs with lowest number of atoms in my jobs (5000),
when I increased the memory more than 2 GB, the job ran fine for more period
than with lower memory but failed later. It failed with the same C++ runtime
abort error, which might be due to again the same memory issues.

Now while it failed, the thermodynamics data was flawless till the last 20
timesteps (each with 0.5 fmsec). It's only in the last timesteps of failure,
that pressure and energy became wild.

thermo_style custom step cpu press temp pe ke etotal enthalpy evdwl ecoul
epair ebond eangle edihed emol elong vol

1635220 3907.42 13.607673 340.59735 1560.6874 6084.4292
7645.1166 7660.2367 3033.6264 10077.156 -6396.558 2958.9411
2404.7583 2593.546 7957.2453 -19507.34 76189.377

1635240 3907.642 -40108.276 3853.8928 34084.738 68845.92
102930.66 55602.399 2953.4679 10297.697 -6248.3665 34279.467
3356.1391 2697.4979 40333.104 -19499.531 80911.572

My understanding is that, if there's a scientific error in the code, then
the job failure should gradually develop and then the job should fail. But
not this suddenly of 20 timesteps? Or 20 timesteps is a sufficient time to
develop a failure?

Does this coincide with my first question hinting towards possibility of
system issues rather than code issues?

Thanks

Chetan

sjplimp · October 23, 2009, 2:49pm

I don't know what you are asking or why your model requires 2 Gb for 5000 atoms,
even on 1 proc. But your thermo output shows the temperature going from
340K to 3850K in 20 timesteps. So that looks like a problem.

Steve

_Chetan_Mahajan1 · October 24, 2009, 2:22am

Hi Steve

Thanks
Since it's working well for higher number of atoms, is the job failure for
5000 atoms due to its run on as high as 64 procs?

I have solvated polymer chains with full atomistic description and
potential, ewald sums etc and so I am running it on such a high number of
procs even a small system of 5000 atoms. I am told by somebody that ideal is
1 proc per 1000 atoms for lammps, although I cant find it in manual. Is it
true?

Thanks
Chetan

akohlmey · October 26, 2009, 2:51pm

chetan,

I have solvated polymer chains with full atomistic description and
potential, ewald sums etc and so I am running it on such a high number
of
procs even a small system of 5000 atoms. I am told by somebody that
ideal is
1 proc per 1000 atoms for lammps, although I cant find it in manual.
Is it true?

why do you ask this, when it is _so_ easy to test it yourself.

just run a series with increasing number of nodes and find out
what is the optimal number for your case.

due to lammps supporting a very large number of very different
potentials, there is no single optimal number. it all depends on
the combination of potentials and the hardware you run on.

i have seen jobs scale out with as little as 32 atoms per
processor (that was on a BG/L) and with more than 10000 atoms
per processor (that was due to kspace not scaling that well).

cheers,
axel.

sjplimp · October 26, 2009, 3:44pm

There is no fundamental reason you cannot run 5000 atoms on 64 procs.
If something is crashing then it could be a bug in LAMMPS, or more likely,
a bug in your input script or model. Whether the simulation will
run efficiently in a parallel sense with 5000/64 = less than 100 atoms
per proc, is another question, as Axel indicates, which is hard to
predict a priori.

Steve