Colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 2)

i have access to a cluster with a job time limit of 24 hours, so i’d like to figure out how to restart daily my multiple-walker metadynamics simulation with COLVARS package.

my production simulation has >1M atoms on 20 nodes of 80 cores partitioned by node (ie. lmp -p 20x80), so i created a minimal example with LJ benchmark:

colvars-lj.in

# mpirun -np 8 lmp_macos -in colvars-lj.in -p 4x2
# mpirun -np 8 lmp_niagara -in colvars-lj.in -p 4x2

units           lj
atom_style      atomic

# ERROR: Fix colvars requires an atom map, see atom_modify (src/COLVARS/fix_colvars.cpp:389)
# https://matsci.org/t/problems-with-using-fix-colvars/20910/2
atom_modify     map yes

lattice         fcc 0.8442
region          box block 0 20 0 20 0 20
create_box      1 box
create_atoms    1 box
mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve
fix             2 all colvars colvars-lj.colvars output colvars-lj seed 12345

run             100
write_restart   colvars-lj-%.restart

colvars-lj-restart.in

# mpirun -np 8 lmp_macos -in colvars-lj-restart.in -p 4x2
# mpirun -np 8 lmp_niagara -in colvars-lj-restart.in -p 4x2

# ERROR: Fix colvars requires an atom map, see atom_modify (src/COLVARS/fix_colvars.cpp:389)
# https://matsci.org/t/problems-with-using-fix-colvars/20910/2
atom_modify     map yes

read_restart   colvars-lj-%.restart

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve
fix             2 all colvars colvars-lj.colvars output colvars-lj seed 12345

run             100

colvars-lj.colvars

colvar {
  name foo
  rmsd {
    atoms { atomNumbersRange 1-3 }
    refPositions { (0,0,0) (0,0,0) (0,0,0) }
  }
  lowerBoundary 0
  upperBoundary 30
  width 0.25
}

metadynamics {
  colvars foo
  hillWeight 0.1
  hillWidth 2.0
  newHillFrequency 10
  multipleReplicas on
  replicasRegistry /tmp
  replicaUpdateFrequency 100
}

$ less log.lammps.2

colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 2).
ERROR on proc 0: Fatal error in the collective variables module.
(src/COLVARS/colvarproxy_lammps.cpp:294)

should i be using a fix_modify Colvars load "<oldjob>.colvars.state" for each replica ? Based on the COLVARS LAMMPS manual, this is not needed:

“Note that the Colvars state is already loaded automatically as part of the LAMMPS restart file, when this is read via the LAMMPS read_restart command; the “load” method allows to load a different state file after the fact." [COLVARS LAMMPS manual 3.6.1]

Is there an issue restarting with partitions ? Any other suggestions please …

This is happening with both macos and linux versions, recent lammps git repos ( 7 Feb 2024 and 17 Apr 2024 ).

Hi @alphataubio,

from the section 6.4.7 of the manual it is mentioned that the replicaID argument is set to default using MPI’s numeric indexing. Since you do not seem to be setting it yourself, this is what happens when initializing the fix colvar.

So it is likely that there is a conflict between the replicaID argument saved in the restart files and the values that is automatically set using the MPI indexing from the fix command, since the latter has no reason to be the same on different executions. A solution might be to set it by hand in each of your replica, or load the state as you suggested, maybe before reading LAMMPS restart files.

because the error happens at fix colvars, then it’s not possible to do a fix_modify either before or after to set a replicaID in each partition. chicken-and-egg problem…

with further debugging, i see error message(s) like

colvars: Saving collective variables state to “colvars-lj.colvars.state”.
colvars: Error: in renaming file “colvars-lj.pmf” to “colvars-lj.pmf.BAK”.

so it looks like maybe different partitions are clobbering the colvars state file, by all trying to write to same restart file which then only has one replicaID.

im still investigating, @giacomo.fiorin any suggestions how to resolve this bug ?

Hi @alphataubio thanks for tagging me, I didn’t see this message right away.

Indeed I think the problem here is that the replicas have an identical output prefix, and so they will overwrite each other’s output files. For multi-replica simulations LAMMPS offers the partition command or world-style variables.

Since you have 4 replicas, you could probably have something like:

variable RID world 1 2 3 4
fix             2 all colvars colvars-lj.colvars output colvars-lj-${RID} seed ${RID}

Would that work?

FYI the Colvars doc page that you linked refers to the Colvars master branch, which is not yet integrated in the LAMMPS distribution. Are you using that or a standard LAMMPS release?

Thanks
Giacomo

yes i am using partitions with -p command-line option. after your suggestion there are now 4 files colvars-lj-1.colvars.state, …, colvars-lj-4.colvars.state with the correct replicaID instead of just one colvars-lj.colvars.state. However im surprised to still be getting the same error:

colvars-lj.zip (39.7 KB)

$ grep Error log.lammps.*
log.lammps.0:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 0).
log.lammps.1:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 1).
log.lammps.3:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 3).

maybe is it related to the comment “// TODO call write_output_files()” in
lammps/src/COLVARS/fix_colvars.cpp ?

void FixColvars::write_restart(FILE *fp)
{
  if (me == 0) {
    std::string rest_text;
    proxy->serialize_status(rest_text);
    // TODO call write_output_files()
    const char *cvm_state = rest_text.c_str();
    int len = strlen(cvm_state) + 1; // need to include terminating null byte.
    fwrite(&len,sizeof(int),1,fp);
    fwrite(cvm_state,1,len,fp);
  }
}

by happy coincidence you made me realize there’s a much bigger bug in my code: since i was using fix colvars ... seed 12345 then in my production runs my 20 multiple-walkers were effectively all sampling the same paths starting with same seed. very nice to have a 20X speedup ! I suggest this seed issue should be documented in multiple-walkers metadynamics section 6.4.7 of colvars manual.

i always use a recent git clone of lammps develop branch:

Large-scale Atomic/Molecular Massively Parallel Simulator - 17 Apr 2024
Git info (develop / patch_17Apr2024-8-g628531dadb)

I don’t see all of your output files, but you used replicasRegistry /tmp, i.e. you specified the path of a directory instead of a file. I have not tested the code’s behavior under that condition, but for sure you will want to specify the path of a file.

The output files that you have included in the Zip file look okay to me. Can you still get the error above if you continue that run, or start fresh (now that you have specified different outputs for each replica)?

No, that refers to writing output files other than state/restart files, which normally would be written at the end of a successful run. The comment refers to whether these should be written prior to the conclusion of a run as well.

A note could be added, but please do keep in mind that the internal seed for Colvars only affects features that use random numbers, and with your configuration you are not using any.

Instead, to diversify the initial conditions of the replicas, I would use different seeds in the LAMMPS velocity command, wheere atomic velocities are generated. (Currently, you use the same seed of 87287).

Giacomo

yes i restarted from scratch as follows with

replicasRegistry /home/~~~~~/lammps/colvars-lj/colvars-lj.replicas

instead of replicasRegistry /tmp.

$ rm -f *BAK *txt log.lammps* screen.* *hills *restart *state *pmf *old *~ *traj *replicas
$ mpirun -np 8 lmp_niagara -p 4x2 -nonbuf -in colvars-lj.in
$ mpirun -np 8 lmp_niagara -p 4x2 -nonbuf -in colvars-lj-restart.in
$ cd ..; zip colvars-lj.zip colvars-lj/*

but im still getting the same error.

colvars: Error: in the state file , the “metadynamics” block has a different replicaID (2 instead of 0).
ERROR on proc 0: Fatal error in the collective variables module.
(src/COLVARS/colvarproxy_lammps.cpp:290)
Last command: fix 2 all colvars colvars-lj.colvars output colvars-lj-{RID} seed {RID}

here’s the latest zip file of my directory:

colvars-lj.zip (42.9 KB)

Okay, let me build that LAMMPS snapshot and test it locally.