Colvars stops dumping after ~2^31 steps

Yura_Polyachenko · December 13, 2024, 4:14am

I am using LAMMPS 23Jun2022 and COLVARS from 2022-05-09.

I have a simple colvars file used in my simulation. It does ABF biasing of a distance between CoM of 2 groups of atoms. I don’t think its details are important but I can post it if you think its contents might somehow be important.

The lammps input file is:

...
fix                     abf all colvars abf_BDT.colvars tstat fix_langevin seed 226 output abf

dump            dumpProd        all xyz 2000000 prod.xyz
thermo          2000000
thermo_modify   flush yes

restart         1000000 prod_1.restart prod_2.restart

run                     2000000000
run                     2000000000

Then the prod.log looks like:

...
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2144000000   1.0696377      1.6489294      0.7622016
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2146000000   1.0241619      1.6144459      1.8996601
colvars: Synchronizing (emptying the buffer of) trajectory file "abf.colvars.traj".
colvars: Saving collective variables state to "prod_1.restart.colvars.state".
2148000000   0.94262713     1.2080264      0.7647839
2150000000   1.0375809      1.3780924      0.82843873
2152000000   0.96214974     1.6101656      1.4263244
2154000000   1.0146399      1.3333802      1.1378838
2156000000   1.0618743      1.6131358      0.9176836
2158000000   1.0499825      1.3201005      0.86687496
2160000000   1.0026682      1.573913       1.094465
...

and

$ tail abf.colvars.traj 
  2147474000    4.05303602826298e+00  
  2147475000    5.64231016584322e+00  
  2147476000    1.18513042521774e+01  
  2147477000    1.68327539999665e+01  
  2147478000    1.69308585151818e+01  
  2147479000    1.50852408863132e+01  
  2147480000    1.59986091171063e+01  
  2147481000    1.42899089811339e+01  
  2147482000    6.74848594216875e+00  
  2147483000    2.68667889444293e+00

so looks like colvars just “died” quietly. But I don’t know how lammps works internally and it’s hard for me to imagine a part of a running program would die unless it’s a separate thread.

My colvars dumpstep is 1000 and log2(2147483000) < 31 but log2(2147484000) > 31, so it (most likely) happened exactly after 2^31.

After finding this, I originally thought colvars just used a regular int for for the step, which would be unusual, but ok fine, fixable. However, they have a special type declared colvarmodule.h:99: typedef long long step_number; and they seem to use it in all the right places. So I am not sure what the issue might be. Perhaps they have something non-obvious as a simple int and it gets assigned the timestep, which then breaks the whole module. But I am not sure how I would look for it without compiling everything in debug mode, running for 2bln steps (or setting the timestep to something like 2^31 - 10 might be enough), and looking at what happens after it crosses 2^31.

So I am now looking for a way to run with colvars for more than 2^31 steps. I could try to put the timestep back to 0 after run 2000000000, but this would make log-files harder to parse so I’d prefer to avoid it if possible.

akohlmey · December 13, 2024, 5:02am

Do you get the same behavior with this extra simple test input deck (must be run with an even number of MPI processes. Fastest should be with 2)?

in.colvars-bigint (320 Bytes)
bigint.colvars (307 Bytes)

Does it run/fail when you comment out the processors command and run without MPI?
Does it run/fail with the (serial) LAMMPS executable from here?

Yura_Polyachenko · December 13, 2024, 7:05pm

Thank you for the test cases.

In short: running both with my 2MPIs and my serial gives the same quiet dying, but your executable manages to go over 2^31 without any issues.

Long: here are all the logs: repo

My full compile script for serial was:

#!/bin/bash

# build a mixed-precision version of lammps for della (cpu) using the intel package

#VERSION=29Sep2021
VERSION=stable_23Jun2022_update3
BUILD_NAME=2022_3_CV_dipole_ser_my
#wget https://github.com/lammps/lammps/archive/${VERSION}.tar.gz
#rm -rf lammps-_${VERSION}
#tar -zxvf ${VERSION}.tar.gz
cd lammps-${VERSION}
mkdir build_${BUILD_NAME}
cd build_${BUILD_NAME}

module purge
module load intel/19.1.1.217
module load intel-mpi/intel/2019.7

cmake3 -D CMAKE_INSTALL_PREFIX=$HOME/.local \
-D LAMMPS_MACHINE=della_${BUILD_NAME} \
-D ENABLE_TESTING=no \
-D BUILD_MPI=no \
-D BUILD_OMP=yes \
-D CMAKE_BUILD_TYPE=Release \
-D CMAKE_CXX_COMPILER=icpc \
-D CMAKE_CXX_FLAGS_RELEASE="-Ofast -xHost -qopenmp -DNDEBUG" \
-D PKG_MOLECULE=yes \
-D PKG_RIGID=yes \
-D PKG_KSPACE=yes -D FFT=MKL -D FFT_SINGLE=yes \
-D PKG_INTEL=yes \
-D PKG_COLVARS=yes \
-D PKG_EXTRA-PAIR=yes \
-D INTEL_ARCH=cpu -D INTEL_LRT_MODE=threads ../cmake

make -j 10
make install

akohlmey · December 13, 2024, 7:53pm

The most likely explanation is that there is a bug in your version of the colvars library or package that has since been fixed and thus you need to upgrade your LAMMPS (and thus also the colvars) version.

The second most likely explanation is that your compiler is miscompiling LAMMPS and mine isn’t. Your version of the Intel compiler is very old anyway. More recent versions are available and at no cost, too, unless you want to purchase support.

Yura_Polyachenko · December 13, 2024, 9:41pm

You were right, I found the 2024 versions of icpx and the 29Aug2024 lammps and it now works for me

akohlmey · December 13, 2024, 11:51pm

It would be interesting to know whether it was the update of the source code or the update of the compiler that addressed the issue.

Yura_Polyachenko · December 14, 2024, 3:08am

29Aug2024 compiled with icpc also works

giacomo.fiorin · December 14, 2024, 11:02pm

Hi, thanks for reporting. Indeed there is an issue that has been fixed since:

github.com/Colvars/colvars

ABF histogram overflow shortly after 2 microseconds

opened 10:51AM - 04 Dec 23 UTC

closed 07:08PM - 05 Dec 23 UTC

mvondomaros

Hi! I have several independent simulations (Colvars 2023-10-03, Gromacs 2023.…2, WTM-eABF with distanceZ colvar), that run fine for 2 microseconds, but eventually cause some sort of overflow in the ABF histogram. See screenshot attached. The histogram did not have a spike when starting the run. Contrary to what one might believe from the histogram, the colvars trajectory is not stuck at this position. The wrong counts lead eventually to artifacts in the free energy, presumably because CZAR works with them. Since all my simulations have this issue shortly after 2 microseconds (1 fs timestep), and since the histogram appears to overflow at the bin corresponding to the value of the collective variable at step ~2^32/2, I am suspecting some sort of Int32 overflow, but with my limited knowledge of the code, I haven't found any possible candidate. ![colvars abf count](https://github.com/Colvars/colvars/assets/32263528/9520a3d8-fc90-4b53-b9b3-8e7fc67637c9) ![colvars traj](https://github.com/Colvars/colvars/assets/32263528/f9e678c6-17ad-45d1-ac6b-0bbb6f532752) Raw files are on [Google Drive](https://drive.google.com/drive/folders/1pyzQeAqt1YSrmodTBJyOhqfegtb-JBEf?usp=share_link)