Colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 2)

i have access to a cluster with a job time limit of 24 hours, so i’d like to figure out how to restart daily my multiple-walker metadynamics simulation with COLVARS package.

my production simulation has >1M atoms on 20 nodes of 80 cores partitioned by node (ie. lmp -p 20x80), so i created a minimal example with LJ benchmark:

colvars-lj.in

# mpirun -np 8 lmp_macos -in colvars-lj.in -p 4x2
# mpirun -np 8 lmp_niagara -in colvars-lj.in -p 4x2

units           lj
atom_style      atomic

# ERROR: Fix colvars requires an atom map, see atom_modify (src/COLVARS/fix_colvars.cpp:389)
# https://matsci.org/t/problems-with-using-fix-colvars/20910/2
atom_modify     map yes

lattice         fcc 0.8442
region          box block 0 20 0 20 0 20
create_box      1 box
create_atoms    1 box
mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve
fix             2 all colvars colvars-lj.colvars output colvars-lj seed 12345

run             100
write_restart   colvars-lj-%.restart

colvars-lj-restart.in

# mpirun -np 8 lmp_macos -in colvars-lj-restart.in -p 4x2
# mpirun -np 8 lmp_niagara -in colvars-lj-restart.in -p 4x2

# ERROR: Fix colvars requires an atom map, see atom_modify (src/COLVARS/fix_colvars.cpp:389)
# https://matsci.org/t/problems-with-using-fix-colvars/20910/2
atom_modify     map yes

read_restart   colvars-lj-%.restart

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix             1 all nve
fix             2 all colvars colvars-lj.colvars output colvars-lj seed 12345

run             100

colvars-lj.colvars

colvar {
  name foo
  rmsd {
    atoms { atomNumbersRange 1-3 }
    refPositions { (0,0,0) (0,0,0) (0,0,0) }
  }
  lowerBoundary 0
  upperBoundary 30
  width 0.25
}

metadynamics {
  colvars foo
  hillWeight 0.1
  hillWidth 2.0
  newHillFrequency 10
  multipleReplicas on
  replicasRegistry /tmp
  replicaUpdateFrequency 100
}

$ less log.lammps.2

colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 2).
ERROR on proc 0: Fatal error in the collective variables module.
(src/COLVARS/colvarproxy_lammps.cpp:294)

should i be using a fix_modify Colvars load "<oldjob>.colvars.state" for each replica ? Based on the COLVARS LAMMPS manual, this is not needed:

“Note that the Colvars state is already loaded automatically as part of the LAMMPS restart file, when this is read via the LAMMPS read_restart command; the “load” method allows to load a different state file after the fact." [COLVARS LAMMPS manual 3.6.1]

Is there an issue restarting with partitions ? Any other suggestions please …

This is happening with both macos and linux versions, recent lammps git repos ( 7 Feb 2024 and 17 Apr 2024 ).

Hi @alphataubio,

from the section 6.4.7 of the manual it is mentioned that the replicaID argument is set to default using MPI’s numeric indexing. Since you do not seem to be setting it yourself, this is what happens when initializing the fix colvar.

So it is likely that there is a conflict between the replicaID argument saved in the restart files and the values that is automatically set using the MPI indexing from the fix command, since the latter has no reason to be the same on different executions. A solution might be to set it by hand in each of your replica, or load the state as you suggested, maybe before reading LAMMPS restart files.

because the error happens at fix colvars, then it’s not possible to do a fix_modify either before or after to set a replicaID in each partition. chicken-and-egg problem…

with further debugging, i see error message(s) like

colvars: Saving collective variables state to “colvars-lj.colvars.state”.
colvars: Error: in renaming file “colvars-lj.pmf” to “colvars-lj.pmf.BAK”.

so it looks like maybe different partitions are clobbering the colvars state file, by all trying to write to same restart file which then only has one replicaID.

im still investigating, @giacomo.fiorin any suggestions how to resolve this bug ?

Hi @alphataubio thanks for tagging me, I didn’t see this message right away.

Indeed I think the problem here is that the replicas have an identical output prefix, and so they will overwrite each other’s output files. For multi-replica simulations LAMMPS offers the partition command or world-style variables.

Since you have 4 replicas, you could probably have something like:

variable RID world 1 2 3 4
fix             2 all colvars colvars-lj.colvars output colvars-lj-${RID} seed ${RID}

Would that work?

FYI the Colvars doc page that you linked refers to the Colvars master branch, which is not yet integrated in the LAMMPS distribution. Are you using that or a standard LAMMPS release?

Thanks
Giacomo

yes i am using partitions with -p command-line option. after your suggestion there are now 4 files colvars-lj-1.colvars.state, …, colvars-lj-4.colvars.state with the correct replicaID instead of just one colvars-lj.colvars.state. However im surprised to still be getting the same error:

colvars-lj.zip (39.7 KB)

$ grep Error log.lammps.*
log.lammps.0:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 0).
log.lammps.1:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 1).
log.lammps.3:colvars:   Error: in the state file , the "metadynamics" block has a different replicaID (2 instead of 3).

maybe is it related to the comment “// TODO call write_output_files()” in
lammps/src/COLVARS/fix_colvars.cpp ?

void FixColvars::write_restart(FILE *fp)
{
  if (me == 0) {
    std::string rest_text;
    proxy->serialize_status(rest_text);
    // TODO call write_output_files()
    const char *cvm_state = rest_text.c_str();
    int len = strlen(cvm_state) + 1; // need to include terminating null byte.
    fwrite(&len,sizeof(int),1,fp);
    fwrite(cvm_state,1,len,fp);
  }
}

by happy coincidence you made me realize there’s a much bigger bug in my code: since i was using fix colvars ... seed 12345 then in my production runs my 20 multiple-walkers were effectively all sampling the same paths starting with same seed. very nice to have a 20X speedup ! I suggest this seed issue should be documented in multiple-walkers metadynamics section 6.4.7 of colvars manual.

i always use a recent git clone of lammps develop branch:

Large-scale Atomic/Molecular Massively Parallel Simulator - 17 Apr 2024
Git info (develop / patch_17Apr2024-8-g628531dadb)

I don’t see all of your output files, but you used replicasRegistry /tmp, i.e. you specified the path of a directory instead of a file. I have not tested the code’s behavior under that condition, but for sure you will want to specify the path of a file.

The output files that you have included in the Zip file look okay to me. Can you still get the error above if you continue that run, or start fresh (now that you have specified different outputs for each replica)?

No, that refers to writing output files other than state/restart files, which normally would be written at the end of a successful run. The comment refers to whether these should be written prior to the conclusion of a run as well.

A note could be added, but please do keep in mind that the internal seed for Colvars only affects features that use random numbers, and with your configuration you are not using any.

Instead, to diversify the initial conditions of the replicas, I would use different seeds in the LAMMPS velocity command, wheere atomic velocities are generated. (Currently, you use the same seed of 87287).

Giacomo

yes i restarted from scratch as follows with

replicasRegistry /home/~~~~~/lammps/colvars-lj/colvars-lj.replicas

instead of replicasRegistry /tmp.

$ rm -f *BAK *txt log.lammps* screen.* *hills *restart *state *pmf *old *~ *traj *replicas
$ mpirun -np 8 lmp_niagara -p 4x2 -nonbuf -in colvars-lj.in
$ mpirun -np 8 lmp_niagara -p 4x2 -nonbuf -in colvars-lj-restart.in
$ cd ..; zip colvars-lj.zip colvars-lj/*

but im still getting the same error.

colvars: Error: in the state file , the “metadynamics” block has a different replicaID (2 instead of 0).
ERROR on proc 0: Fatal error in the collective variables module.
(src/COLVARS/colvarproxy_lammps.cpp:290)
Last command: fix 2 all colvars colvars-lj.colvars output colvars-lj-{RID} seed {RID}

here’s the latest zip file of my directory:

colvars-lj.zip (42.9 KB)

Okay, let me build that LAMMPS snapshot and test it locally.

@giacomo.fiorin thanks for trying to help, let me pitch in.

after many hours of debugging, i have a good idea about what the bug is but not a solution yet.

if i rerun colvars-lj-restart.in over and over again, then i always get the same pattern of errors:

% grep Error screen.*
screen.0:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (3 instead of 0).
screen.1:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (3 instead of 1).
screen.2:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (3 instead of 2).

if however i rerun colvars-lj.in once, then colvars-lj-restart.in after multiple times then i get a new pattern of errors:

% grep Error screen.*
screen.0:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 0).
screen.2:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 2).
screen.3:colvars: Error: in the state file , the “metadynamics” block has a different replicaID (1 instead of 3).

therefore this indicates the bug is on the write_restart side, not read_restart because colvars-lj-% state files for all replicas are correct:

% grep replicaID colvars-lj*colvars.state
colvars-lj-0.colvars.state: replicaID 0
colvars-lj-0.colvars.state: replicaID 0

colvars-lj-1.colvars.state: replicaID 1
colvars-lj-1.colvars.state: replicaID 1

colvars-lj-2.colvars.state: replicaID 2
colvars-lj-2.colvars.state: replicaID 2

colvars-lj-3.colvars.state: replicaID 3
colvars-lj-3.colvars.state: replicaID 3

so the replica state files are being written correctly. HOWEVER, the colvars-lj-base.restart is in binary but still a part is readable:

% less colvars-lj-base.restart
[… binary bytes …] configuration {
step 600
dt 5.000000e-03
version 2023-05-01
}
colvar {
name foo
x 6.23788303943019e-01
}
metadynamics {
configuration {
step 600
name metadynamics1
replicaID 1
}
hills_energy
grid_parameters {

see that “replicaID 1” ? which then matches the pattern of errors on restart (1 instead of 0), (1 instead of 2), (1 instead of 3).

CONCLUSION: there’s a bug somewhere in how multiple replicas write to lammps restart base file, maybe clobbering each other.

since he wrote the lammps side of the proxy, does @akohlmey have a suggestion for me to investigate further ?

There is no bug. You are using the write_restart and read_restart commands incorrectly.

You are using
write_restart colvars-lj-%.restart
But you should be using
write_restart colvars-lj-${RID}.restart

On restart you then need to move the
variable RID world 1 2 3 4
before reading the restart and that command needs to be
read_restart colvars-lj-${RID}.restart

To explain, embedding the ‘%’ character creates a restart file where each MPI rank (of a partition) writes a separate file and rank 0 also a .base.restart file, but all partitions will write to the same file names and thus you are corrupting the restart files.

When using the world style variable (I prefer “universe” style, where you can define more values than you need and thus when you are debugging with fewer replica you won’t get an error) each partition will write its own (single) restart file and then will read it back and restore the state of the colvars fix with the correct replicaID from the binary restart file.

1 Like

i knew there was some clobbering going on but it was my scripts not lammps code, my apologies.

my preference would be the thermo variable part so there’s no need to define a world or universe variable and it’s scalable without editing input script for many different nodes setups (16x40, 32x48, 32x32, 48x64, 20x80) on the 5 clusters i have access to. i also want to measure pmf convergence rates for different configurations on a given cluster (eg. 20x80 vs 10x160 vs 40x40 vs 80x20 …)

i tried two different ways but i cant access the partition information in the script. it has to be available somehow because lammps knows how to write to screen.[0,1,2,…] log files ?

ERROR: Variable evaluation before simulation box is defined (src/variable.cpp:2309)
Last command: read_restart colvars-lj-$(part).restart

ERROR on proc 0: Substitution for illegal variable part (src/input.cpp:666)
Last command: read_restart colvars-lj-${part}.restart

for now ill go with:

variable replicaID universe 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 &
  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 &
  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

read_restart   colvars-lj-${replicaID}.restart

neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no

fix 1 all nve
fix 2 all colvars colvars-lj.colvars & 
    output colvars-lj-${replicaID} seed $(v_replicaID+1)

the $(v_replicaID+1) instead of ${replicaID} is needed otherwise i get a zero seed error:

ERROR on proc 0: Invalid seed for Park random # generator (src/random_park.cpp:34)
Last command: fix 2 all colvars colvars-lj.colvars output colvars-lj-{replicaID} seed {replicaID}

This error message is self-explanatory.

This is bogus, since you have not define a variable “part”.

LAMMPS doesn’t use variables to set the replica ID of output file, thus it is not subject to the restrictions on variable evaluations before the system box has been created. Lots of things in LAMMPS must be done after the box is defined and very few things can be done before.

1 Like

@alphataubio Did the simulation run correctly (i.e. writing state files for each replica)?