Bugs in maintenance of shear history across restart files?

_Hanley_Kevin_J · June 11, 2012, 2:16pm

I think that there might be two bugs in LAMMPS for granular materials in which shear history is stored and restart files are also being used.

The first is that the shear history may not be written correctly to the restart file. The information which is to be written to the restart file is gathered in a buffer; for shear information, the pack_restart function in FixShearHistory is used for this. When called, the pack_restart function adds the data that are stored in the partner, npartner and shearpartner arrays to the buffer for writing. The shearpartner array, which contains the shear displacements, is updated when the pre_exchange function of FixShearHistory is called. When this pre_exchange function is run, the displacements for the current timestep (in the shear array) are copied to shearpartner. The problem is that this pre_exchange function is run only when the neighbour lists are rebuilt, and neighbour lists are not necessarily updated immediately before the restart file is written. Therefore, the shear information that is written could be out of date. I think that this can be fixed by changing the following code in WriteRestart::Command…

if (domain->triclinic) domain->x2lamda(atom->nlocal);

domain->pbc();

domain->reset_box();

comm->setup();

comm->exchange();

comm->borders();

if (domain->triclinic) domain->lamda2x(atom->nlocal+atom->nghost);

write(file);

delete [] file;

…to read as follows, i.e., forcing a neighbour list rebuild:

if (modify->n_pre_exchange) modify->pre_exchange();

if (domain->triclinic) domain->x2lamda(atom->nlocal);

domain->pbc();

domain->reset_box();

comm->setup();

if (neighbor->style) neighbor->setup_bins();

comm->exchange();

if (atom->sortfreq > 0) atom->sort();

comm->borders();

if (domain->triclinic) domain->lamda2x(atom->nlocal+atom->nghost);

if (modify->n_pre_neighbor) modify->pre_neighbor();

neighbor->build();

write(file);

delete [] file;

Perhaps not all of these lines are necessary, and it would be more efficient to invoke these additions only if shear history is stored so an ‘if’ condition to check for this could be included.

A second, and more serious, bug occurs when shear history is being restored from a restart file. The unpack_restart function in FixShearHistory is invoked to set npartner, shearpartner and partner correctly. The init function of the Neighbor class is run and the NeighList constructor is called for each list to be built. The Verlet::setup function is called next, before the neighbour lists are actually built (e.g., by Neighbor::Build calling Neighbor::granular_bin_no_newton). Verlet::setup runs the following line: modify->setup_pre_exchange();, which causes FixShearHistory::pre_exchange to be run. At this stage, list->inum is zero; this is set when the neighbour lists are built. Within the pre_exchange function, all entries in the npartner array are set to zero. The loop is intended to repopulate the npartner array, but because inum is zero, no iterations of the loop are run. Therefore, the npartner array remains filled with zeros. Finally, on neighbour list rebuild (e.g., Neighbor::granular_bin_no_newton), all of the touchptr and shearptr entries are zeroed because of the zeroed npartner entries, and the shear history is lost.

As a temporary repair, I made a small change in FixShearHistory::pre_exchange so that it reads as follows:

if (list->inum > 0)

for (i = 0; i < nlocal; i++) npartner[i] = 0;

This seems to work based on some tests I have done, and ensures that the shear information is preserved. I’m not fully certain of the implications though, e.g., what happens if a restart file is written and read on systems with differing numbers of processors.

I would be interested to hear any thoughts and comments.

Regards,

Kevin Hanley

Postdoctoral researcher

Dept. of Civil and Environmental Engineering

Imperial College London, SW7 2AZ

sjplimp · June 12, 2012, 1:47pm

Thanks for the careful descriptions - I'll look into it.
For the first issue with writing restart files, is this
only when using the write_restart command,
and not the restart command (which writes restart
files periodicially during a run)? The latter
does reneighbor before writing each restart file.

Steve

_Hanley_Kevin_J · June 12, 2012, 5:09pm

I've seen this issue only with the write_restart command. The restart command seems to be unaffected, presumably because of the neighbour list rebuilding before writing, as you said.

Kevin

sjplimp · June 14, 2012, 11:32pm

I just posted both of these fixes as a 16Jun patch.
I think they were both correct assessments of the
problem and solution. I had known there was some
issue with doing identical restarts from files that
had granular shear history info in them, but hadn't
been able to track down the issue with inum=0 and
the first pre_exchange() wiping things out.
Nicely done!

I ran a test of the bench/chute problem with a restart
after 500 steps. Printing thermo output every timestep,
including pressure, the match between a continued
run and a 2nd restarted run is now much closer. But
it is still not exact even on the first few steps. So
that's still puzzling to me. Maybe there is some other
quantity in granular models that isn't preserved exactly
enough across a restart? Any ideas?

Also, Peter - I am hoping this fix, which turned out to
be quite simple if you read below, will also fix the issues
with granular restarts that you sent me months ago.
Can you check if this patch does the trick for your problems
as well?

Thanks,
Steve

_Gordon_Peter_A · June 15, 2012, 10:51am

Steve - glad to see this is still on the radar. I tested the Jun17-12 version and found that the patch fixes a subset of the continuation problems, but didn't fix the restarts. In addition, upon using minimization (fire algorithm), both run continuation and restarts failed.

Having said that, I had patched a version of the lammps late last fall that solved these problems, along with qa tests for doing run-continuation and restarts with history-dependent granular potentials. We have been happily using that codebase in our granular work ever since.

qa-tests worked for multiple granular potentials, with dynamics and statics, and both in serial and parallel. I will dig out my old email along with the tarballs containing the patches, simple qa-tests that you can run, and an explanation of the changes I hade made at the time.

Kevin - perhaps you can take a look as well to compare the changes you made to those I proposed to Steve and decide what might work best.

Thanks,

Peter

_Hanley_Kevin_J · June 15, 2012, 12:56pm

Peter - thank you for sending on those files. I'll have a look at them this afternoon and send you an e-mail.

Steve - I know that Dr Christoph Kloss already contacted you about the addition to fix_shear_history.cpp being commented out and hence inactive.

I'm not sure that adding the requirement to the WriteRestart::command for FixShearHistory::pre_exchange to be run is sufficient. I've run some tests this morning, and it seems to be the case that the shear history is still not correct unless the neighbour lists are rebuilt prior to writing the restart file.

I might possibly have one explanation. In the WriteRestart::command function, Comm::setup, exchange and borders are run to transfer atoms between processors. If the neighbour lists are not rebuilt, list->inum and list->ilist remain the same, despite the numbers of atoms on each processor potentially changing. This will be sorted out when the neighbour lists are rebuilt by Verlet::setup. However, FixShearHistory::pre_exchange is run before this, and it makes use of both inum and ilist. Since these were not updated after the atoms were transferred (and inum > 0 if the run is continued after the restart file is written by write_restart), the shear history may therefore be subtly changed. There may be other effects as well caused by not rebuilding the neighbour list.

On the subject of restart files, I've noticed that a continued run and run containing a restart will differ slightly if fix_deform is used with the remap x option (probably true for any fix that repositions particles). This is because the velocity-Verlet loop is changed for timestep '0' on a restart. As you well know, the normal integration order is initial_integrate (half v and x), force computation in the pairstyle and final_integrate (second half of v). At the end of the timestep, the particle positions are remapped to the deformed box. On a restart, Verlet::setup firstly calculates the forces as these are needed for initial_integrate. Since the normal forces is calculated from positions, and these were changed by the fix_deform remapping but not by initial_integrate, everything thereafter will be out of sync compared to a run without restart files.

In fact, I bodged the integration loop a bit to avoid this issue when testing continuity of shear history across a restart file. If you want identical restarts, I think that this could be achieved by writing the restart file on the last timestep either before or after the force computation (doesn't matter which because of shearupdate), but certainly before the final_integrate step. On a restart, Verlet::setup calculates the forces which will be the same as before. Then it is necessary to complete the final timestep by running post_force, final_integrate and the end_of_step functions (including the position remapping within domain caused by fix_deform). By doing this, it is possible to have perfect continuity across a restart file for these particular fixes which move particles. As you might expect, the differences are small though.

Regards,

Kevin

sjplimp · June 15, 2012, 10:51pm

Darn - I was hoping Keven's bug fix would fix everything.

I'm remembering more about this history now. My
recollection is that Peter's "fix" was more invasive
and brute force - i.e. it did some things that were
computationally costly to fix the problem. And
it was complicated enough I didn't understand it.
So I didn't add it to the code, and ran out of
time/brainpower to look into it more (sorry, Peter).

I'm still hopeful there is only a minor bug with
how the continue/restart of granular history
is operating, similar to the bug Kevin found.

So that a precise patch that fixes the bug, instead
of re-works how it is done, would be sufficient.

Kevin, if you can look into this more, that
would be great. Peter if there is a simple
problem (few atoms, few timesteps) that
illustrates the problem(s) that still remain
after Kevin's fix, that would be very helpful.
Maybe that is one of the problems in your
earlier tarball?

Thanks to you both,
Steve

sjplimp · June 15, 2012, 10:54pm

Just saw Kevin's subsequent post -
yes, the inefficiecy I was referring to
was calling pre-exchange all the time.

Kevin, can you explain more about why
inum and list could be bad if atoms are
transferred - I'm not seeing it.

Also, do the remaining problems show
up only in parallel, or also on a 1-proc
continue/restart run. If also in serial,
then there are no atom transfers in that
case.

Steve

_Hanley_Kevin_J · June 18, 2012, 7:43am

Yes, I think there might still be a small disparity in shear history for a one-processor serial implementation. I'll look into this further when I get an opportunity later this week and let you know if I find anything. As regards the inum/ilist issue I mentioned, it is probably a misunderstanding on my part. I'll check when I'll look again at the shear history.

Kevin

_Gordon_Peter_A · June 18, 2012, 12:22pm

Steve & Kevin -

As I recall, my fix was more invasive because the real root issue was that shear history information had to be resynchronized in the post-force phase of the integration step. This is a consequence of the interaction between the neighbor list associated with the history-dependent potential. The problems I think arise because the shear neighbor can update *during* the force evaluation - where changes in particle positions can put particles into (or out of contact), and mandate an updating of the contact list details in order to stay correctly synchronized.

I'm surprised at the computation overhead concerns - my original tests showed almost no difference in runtimes for the patched vs. original codebase. I tried running a few of the testcases I have at 32000 particles on 12 cores this morning on our cluster here, and saw a less than 1% difference in runtimes. I'll try a few more tests when I get a chance to see if the additional post-force updates manifest themselves as timesinks at larger system sizes and more cores - our clusters are totally overloaded at the moment though, and I'm fighting for resource though..

Peter

p.s. from my tarball - try any of the restart examples - they all failed in the patched 17Jun12 version - you need only check in the output file that the printout of the energy and stresses at global timesteps are exactly identical where restarts occur. The input files are called in.<potential>-restart ...