PRD - lost atoms

Hi Steve,

I’m forwarding this, forgot to include the lammps mailing list.

Would it help if I got an earlier crash?

Thank you,

Romain

PRD.tar.gz (538 KB)

yes, the quicker it fails the better for debugging.

Steve

Hi Steve,

I pushed the temperature, and got a crash after 398000 steps (about 40 minutes on 32 CPUs).

The crash again happens right after an event. On the node where the event was not detected, the simulation restarts with the wrong number of atoms. The other node did not complain at this point.

On a single node, the simulation goes on and even a second event is later detected.

Attached is all I have for this crashed run. If you have any questions or if I can do anything else to help, please let me know.

Thank you,

Romain

prd2.tar.gz (187 KB)

UPDATE: I performed the same test on a different machine, and it didn’t crash. I ran on 2x8 CPUs (still 2 nodes), and the run finished normally at the time limit, with 2 events detected.

I will try a large scale run to make sure it’s ok, but it might be a platform problem. But there is nothing special with it AFAIK.

I’ll troubleshoot with the admins, but if anybody has a clue on why this might happen, I’d be happy to hear it.
From what I see, some internode communication specific to par-rep causes some atoms to be lost after an event was detected. Regular MD is fine.

Romain

I did some tests and now know what is going wrong in your simulation, but I am

less clear if it is a problem with your model, or something

we need to allow for in the code.

The issue is that by the time an event occurs, the two

replicas (in your runs) are very different, in the sense that individual

atoms (with a specific ID) have moved a long distance.

The problem is when the replica

that had the event communicates its new coords to

the other replica. Each atom in the other replica

has its coords overwritten by the corresponding atom

in the event replica. But now atoms can be owned by

the wrong processors, in a spatial sense. The step where atoms are “lost”

is when communication is done within the replica to migrate the atoms

to the correct owning procs. If an atom needs to

migrate further than one processor away, it is lost,

just as it would be when neighbor lists are re-built

if a normal one-replica LAMMPS simulation.

That is more likely to happen if you run small systems

on lots of procs (per replica), since that means

the sub-domains on each proc are smaller. Its

also more likely if you run for a very long time between

events, which your model seems to do, partly b/c

you are dephasing for a very long time (100K timesteps?).

I ran your system at a high temp (10K) to get it to fail

quickly (~1000 or so steps), and an atom with the same ID

had moved ~1/2 a box length in z. Since there were 4 procs

in z, that caused the hop-more-than-one-proc-away problem.

The Q is why your 2 replicas are becoming so different,

without detecting an “event” sooner,

as if the two systems are drifting in opposite directions.

I thought maybe it was because you are using fix langevin

w/out zeroing the momentum, which can cause drift over

long times. But I turned that on, and it didn’t help. The

initial assigned velocities should also have no COM motion.

Can you verify, e.g. via viz or dump files, that the 2 systems

have drifted far apart?

We could invoke a more robust communication after

an event is communicated to the other replicas, to insure

no atoms are lost, but it’s more expensive. And I’d like

to understand better why this is happening in your model.

Steve