Colvars stops dumping after ~2^31 steps

I am using LAMMPS 23Jun2022 and COLVARS from 2022-05-09.

I have a simple colvars file used in my simulation. It does ABF biasing of a distance between CoM of 2 groups of atoms. I don’t think its details are important but I can post it if you think its contents might somehow be important.

The lammps input file is:

...
fix                     abf all colvars abf_BDT.colvars tstat fix_langevin seed 226 output abf

dump            dumpProd        all xyz 2000000 prod.xyz
thermo          2000000
thermo_modify   flush yes

restart         1000000 prod_1.restart prod_2.restart

run                     2000000000
run                     2000000000

Then the prod.log looks like:

...
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2144000000   1.0696377      1.6489294      0.7622016
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2146000000   1.0241619      1.6144459      1.8996601
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2148000000   0.94262713     1.2080264      0.7647839
2150000000   1.0375809      1.3780924      0.82843873
2152000000   0.96214974     1.6101656      1.4263244
2154000000   1.0146399      1.3333802      1.1378838
2156000000   1.0618743      1.6131358      0.9176836
2158000000   1.0499825      1.3201005      0.86687496
2160000000   1.0026682      1.573913       1.094465
...

and

$ tail abf.colvars.traj 
  2147474000    4.05303602826298e+00  
  2147475000    5.64231016584322e+00  
  2147476000    1.18513042521774e+01  
  2147477000    1.68327539999665e+01  
  2147478000    1.69308585151818e+01  
  2147479000    1.50852408863132e+01  
  2147480000    1.59986091171063e+01  
  2147481000    1.42899089811339e+01  
  2147482000    6.74848594216875e+00  
  2147483000    2.68667889444293e+00  

so looks like colvars just “died” quietly. But I don’t know how lammps works internally and it’s hard for me to imagine a part of a running program would die unless it’s a separate thread.

My colvars dumpstep is 1000 and log2(2147483000) < 31 but log2(2147484000) > 31, so it (most likely) happened exactly after 2^31.

After finding this, I originally thought colvars just used a regular int for for the step, which would be unusual, but ok fine, fixable. However, they have a special type declared colvarmodule.h:99: typedef long long step_number; and they seem to use it in all the right places. So I am not sure what the issue might be. Perhaps they have something non-obvious as a simple int and it gets assigned the timestep, which then breaks the whole module. But I am not sure how I would look for it without compiling everything in debug mode, running for 2bln steps (or setting the timestep to something like 2^31 - 10 might be enough), and looking at what happens after it crosses 2^31.

So I am now looking for a way to run with colvars for more than 2^31 steps. I could try to put the timestep back to 0 after run 2000000000, but this would make log-files harder to parse so I’d prefer to avoid it if possible.

Do you get the same behavior with this extra simple test input deck (must be run with an even number of MPI processes. Fastest should be with 2)?

in.colvars-bigint (320 Bytes)
bigint.colvars (307 Bytes)

Does it run/fail when you comment out the processors command and run without MPI?
Does it run/fail with the (serial) LAMMPS executable from here?

Thank you for the test cases.

In short: running both with my 2MPIs and my serial gives the same quiet dying, but your executable manages to go over 2^31 without any issues.

Long: here are all the logs: repo

My full compile script for serial was:

#!/bin/bash

# build a mixed-precision version of lammps for della (cpu) using the intel package

#VERSION=29Sep2021
VERSION=stable_23Jun2022_update3
BUILD_NAME=2022_3_CV_dipole_ser_my
#wget https://github.com/lammps/lammps/archive/${VERSION}.tar.gz
#rm -rf lammps-_${VERSION}
#tar -zxvf ${VERSION}.tar.gz
cd lammps-${VERSION}
mkdir build_${BUILD_NAME}
cd build_${BUILD_NAME}

module purge
module load intel/19.1.1.217
module load intel-mpi/intel/2019.7

cmake3 -D CMAKE_INSTALL_PREFIX=$HOME/.local \
-D LAMMPS_MACHINE=della_${BUILD_NAME} \
-D ENABLE_TESTING=no \
-D BUILD_MPI=no \
-D BUILD_OMP=yes \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_CXX_COMPILER=icpc \
-D CMAKE_CXX_FLAGS_RELEASE="-Ofast -xHost -qopenmp -DNDEBUG" \
-D PKG_MOLECULE=yes \
-D PKG_RIGID=yes \
-D PKG_KSPACE=yes -D FFT=MKL -D FFT_SINGLE=yes \
-D PKG_INTEL=yes \
-D PKG_COLVARS=yes \
-D PKG_EXTRA-PAIR=yes \
-D INTEL_ARCH=cpu -D INTEL_LRT_MODE=threads ../cmake

make -j 10
make install

The most likely explanation is that there is a bug in your version of the colvars library or package that has since been fixed and thus you need to upgrade your LAMMPS (and thus also the colvars) version.

The second most likely explanation is that your compiler is miscompiling LAMMPS and mine isn’t. Your version of the Intel compiler is very old anyway. More recent versions are available and at no cost, too, unless you want to purchase support.

You were right, I found the 2024 versions of icpx and the 29Aug2024 lammps and it now works for me

It would be interesting to know whether it was the update of the source code or the update of the compiler that addressed the issue.

29Aug2024 compiled with icpc also works

Hi, thanks for reporting. Indeed there is an issue that has been fixed since: