LAMMPS getting "stuck" - GPU + mpi + addforce

Dear lammps-users

I have a strange problem - my lammps runs get stuck - almost randomly

The setup:

LAMMPS compiled with yes-gpu on (1) openmpi / GTX 480 (2) intel-MPI / Fermi

Running short (50000 - 100000 steps) NVT simulations for 4140 atoms

with harmonic bonds, harmonic angles, opls dihedrals

non-bond interactions are

pair_style lj/cut/coul/cut/gpu 9.0

neighbour lists are unmodified; The "control" file is at the end of the mail

"inserting" a particle

fix dep1 part deposit 1 3 1000000 12345 region reg1 near 1.25

attempt 5000 vx 0.001 0.001 vy 0.001 0.001 vz 0.001 0.001 units box

and adding a force to it

fix force1 part addforce 0.0 ${fy} 0.0

fy is a variable which is assigned at run time from command line with
a float value between 0 and 20.

I am also using

dump edata part custom 1 part.edata.y.${fy} id c_uke c_upe

c_dispart[4] vx vy vz

The simulation gets "stuck"
last observed
fy= 16.709

ERROR on proc 0: Bond atoms 2567 2680 missing on proc 0 at step 178

at other runs,
fy=14.031
several of

WARNING: Dihedral problem: 1 25552 4132 3983 3984 3986

and a

ERROR on proc 1: Bond atoms 3984 3983 missing on proc 1 at step 25553

and several of

Cuda driver error 4 in call at file 'geryon/nvd_timer.h' in line 44.

then stuck.

another similar:
fy=14.612

At other times - under identical conditions, ~100s of such runs
proceed without any issue.

I don't know what you mean by "stuck". You listed
various error messages printed out. When LAMMPS hits
an error where it prints such a message, it exits.
With a warning, it keeps going, since it can recover.

As to why that would happen in some runs and not others,
that's an issue for your simulation. If you are doing
supect things with your insertions and relaxation in a
randomized manner, then sometimes you may get lucky
and sometimes you won't.

Steve

Dear lammps-users

I have a strange problem - my lammps runs get stuck - almost randomly

[...]

at other runs,
fy=14.031
several of
>>>>WARNING: Dihedral problem: 1 25552 4132 3983 3984 3986
and a
>>>>ERROR on proc 1: Bond atoms 3984 3983 missing on proc 1 at step 25553
and several of
>>>>Cuda driver error 4 in call at file 'geryon/nvd_timer.h' in line 44.

then stuck.

if this happens more or less randomly, my first suspicion would
be that the GPU is not working correctly. that may be due to overheating
or a faulty GPU or memory. to make sure that this is not an issue,
i would recommend to run the cuda gpu memtest for a while.

http://sourceforge.net/projects/cudagpumemtest/

[...]

I _expect_ the bond atoms to go missing and the dihedral problem to
crop up - and I _want_ LAMMPS to exit - when this happens - just means
the particle from the "fix dep1" has too much velocity from "fix
force1" - or has hit some of the other atoms with violence. But
sometimes lammps exits, and sometimes it does not.

if you force a floating point based software like lammps into
doing something that may create overflows, then you are
on your own. it is not feasible to have checks for NaN or
other invalid floating point operations everywhere. those
would slow down the code massively. some CPU/Compiler
combinations can generate "hardware traps" for that. the
DEC Alpha processor was able to do that and the GCC
compilers have an -ftrapv flag that is supposed to do that.
of course, you'll have to add a signal handler for that.

Simulations without the additional particle run fine - with nve/nvt/npt.
Simulations _with_ the additional particle but without the addforce
run fine (tested ~5,000 individual runs)- with nve/nvt.
***

Question 1: Is there a way to explicitly ask lammps to quit at the
first sign of trouble - _any_ warning or error

not really. see my explanations above.

Question 2: Is there a way to stop execution if particle deposition is
unsuccessful?

you have the source code, you can try and insert corresponding code.

axel.