[lammps-users] Colvars simulations not terminating

Hello,

I am facing difficulties in completion of some of my LAMMPS jobs when I perform biased simulations using Colvars. The MD simulations are completed in a timely manner but for some reason, the jobs do not terminate. Instead, the jobs keep running until the wall time and then they get killed. Consequently, I do not get any output or the configuration files. What I receive is just an output file from my supercomputer (.o file) in which the thermo_style parameters (timesteps, temperature, etc. whatever I indicate) are printed. This is how I know that the simulations are finished and that the allotted wall time wasn’t inadequate. However, I do not receive the dump configurations or the trajectory files.

It is surprising to note that this issue occurs in only ~15% of my Colvars job submissions, i.e., even if I run the same job over and over again, it would successfully finish ~85% of the times but won’t finish ~15% of the times. Also, this issue is not specific to any particular simulation system. I have faced this issue in different simulation systems in which I had different collective variables, e.g. distanceZ and coordNum collective variables. Kindly find a sample unfinished job of both these kinds under the following link.

image001.pngsample_unfinished_jobs

I tried different versions of LAMMPS starting from 2018 through 2020 but faced this problem in all these versions. I do not face this issue in straightforward MD simulations, i.e. simulations not using Colvars. I have asked my supercomputer helpdesk for any possible installation bugs but they have checked and verified that there is no such fault in their LAMMPS software deployment.

Any help is much appreciated!

Thanks,

Himanshu

Himanshu,

what would be really helpful to track this down (since this seems very unlikely to be easily reproducible) if you could - possibly with the help of your helpdesk folks - wait for such a stalling job and then log into the compute node, note the process IDs of the LAMMPS processes and then use the GDB debugger to attach to a running process with gdb -p ### (you should try multiple, but definitely the one with the lowest PID) and then stop the job after the debugger is attached with CTRL-C and then get a “stack trace” through using the “where” command and copy the text printed to the screen (all of it) so that we get a hint where things are stuck.

thanks,
axel.

image001.png