LAMMPS Hanging on write_data

Hi All,

I have recently been tinkering with the python LAMMPS module and have been working on a short script that performs short NVT relaxations on a molecular crystal and automatically stops and writes a data file when the pressure has converged to within a certain threshold. See the full script below:

log loophugo.log

units           real
atom_style      full
boundary        p p p
timestep 1

bond_style      harmonic
angle_style     harmonic
dihedral_style  hybrid harmonic multi/harmonic
improper_style  harmonic
#pair_style      buck/coul/long 11.0 11.0
pair_style              hybrid/overlay table linear 2550 coul/long 15.0
kspace_style    pppm 1.0e-6
#kspace_modify   slab 3.0
special_bonds   lj/coul 0.0 0.0 1.0

read_data     	tintest.data

pair_coeff 1 1 table HMX_X6table.txt HH
pair_coeff 1 2 table HMX_X6table.txt HN1
pair_coeff 1 3 table HMX_X6table.txt HN2
pair_coeff 1 4 table HMX_X6table.txt HO
pair_coeff 1 5 table HMX_X6table.txt HC
pair_coeff 2 2 table HMX_X6table.txt N1N1
pair_coeff 2 3 table HMX_X6table.txt N1N2
pair_coeff 2 4 table HMX_X6table.txt N1O
pair_coeff 2 5 table HMX_X6table.txt N1C
pair_coeff 3 3 table HMX_X6table.txt N2N2
pair_coeff 3 4 table HMX_X6table.txt N2O
pair_coeff 3 5 table HMX_X6table.txt N2C
pair_coeff 4 4 table HMX_X6table.txt OO
pair_coeff 4 5 table HMX_X6table.txt OC
pair_coeff 5 5 table HMX_X6table.txt CC

pair_coeff * * coul/long

dihedral_coeff          1 multi/harmonic 8.452 0.000 -5.28998 0.00 -3.16003
dihedral_coeff          2 harmonic -0.08 -1 3
dihedral_coeff          3 multi/harmonic 0.095 -1.485 1.61000 -0.22 0.0

neighbor          2.0 bin
#neigh_modify  every 1 delay 5
neigh_modify    delay 0 every 1 check yes one 10000 page 100000

python rbreak input 2 SELF v_b return v_switch format pii here """
def rbreak(handle, b):
	#import scipy as sp
	import math as m
	from lammps import lammps
	lmp = lammps(ptr=handle)
	rank = lmp.extract_setting('world_rank')
	if rank == 0:
		print('1a')
		import os
		print('1b')
		import lammps
		print('1c')

		from lammps import formats
		print('1d')
    
		try:
			log = formats.LogFile("loophugo.log")
		except:
			print("can't find logfile, this is the CWD: "+str(os.getcwd()))
        
		print(2)

		P = log.runs[b-1]["Pxx"]

		print("Iteration Number: " + str(b))
    
		#try:
			#filt = sp.signal.savgol_filter(P,15,2)
		#except:
		#	print("Need more points for each loop iteration, or smaller SAVGOL window length")
		print(4)
        
		#if sum(filt[len(filt)-6:len(filt)-1])/10 < 1:
		if sum(P[len(P)-6:len(P)-1])/5 < 10:
			return int(1)
		else:
			print('Avg P: ' + str(sum(P[len(P)-6:len(P)-1])/5))
			return int(0)
"""

variable rand equal floor(random(1,1000000,3259))
variable switch python rbreak

thermo 10
thermo_style custom step vol temp press pe ke etotal epair evdwl ecoul elong emol ebond eangle edihed eimp enthalpy density lx ly lz pxx pyy pzz pxy pxz pyz
thermo_modify line one

fix  1 all nvt temp 300.0 300.0 100.0

velocity all create 300.0 ${rand} dist gaussian mom yes rot yes

run 60

write_data ./test.data

label thermo_loop
variable b loop 5000

thermo_modify  line one flush yes

velocity all create 300.0 ${rand} dist gaussian mom yes rot yes
# reassign velocities every N steps (frequently)

#dump 1 all custom 100 dump.* id type xs ys zs

if "${switch} == 1" then "jump SELF break"
else "run 60"

next b
jump SELF thermo_loop

label break

variable b delete

write_data ./hugo_relaxed_done.data

run 10 pre no

quit

This code bears a lot of scars from my trying to debug it (such as the run 10 pre no command at the bottom). The issue I am having is that the script seems to be running fine and gives indication that the python function is executing well and all, but LAMMPS hangs indefinitely when trying to write the final ./hugo_relaxed_done.data (see record file at end of post). I am having trouble understanding what the problem is here. I do relaxations like this all the time (sans the fancy python machinery) and haven’t run into this issue before. I am confident that the python function is causing some problem with the write_data command, but the connection between the two seems weak as the python function is only defining some binary variable used to break out of the relaxation loop. Does anyone with more experience with the Python module have any clues about what might be happening here?

This is more of a proof-of-concept code, as there are clearly simpler ways to accomplish the relaxation of a large system that don’t involve such complications, but I am hoping to leverage the python interconnectivity on more complicated calculations in the future, so it would be great to get down to the bottom of this.

Record File:

LAMMPS (23 Jun 2022 - Update 2)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (7.8850446 -1.2069369 0.74172168) to (3002.1842 309.90694 285.45828)
  32 by 4 by 2 MPI processor grid
  reading atoms ...
  2462523 atoms
  reading velocities ...
  2462523 velocities
  scanning bonds ...
  3 = max bonds/atom
  scanning angles ...
  6 = max angles/atom
  scanning dihedrals ...
  4 = max dihedrals/atom
  scanning impropers ...
  1 = max impropers/atom
  reading bonds ...
  2462523 bonds
  reading angles ...
  4221468 angles
  reading dihedrals ...
  3517890 dihedrals
  reading impropers ...
  703578 impropers
Finding 1-2 1-3 1-4 neighbors ...
  special bond factors lj:    0        0        1       
  special bond factors coul:  0        0        1       
     4 = max # of 1-2 neighbors
     8 = max # of 1-3 neighbors
    11 = max # of special neighbors
  special bonds CPU = 3.255 seconds
  read_data CPU = 55.141 seconds
WARNING: 1 of 2550 force values in table HN1 are inconsistent with -dE/dr.
WARNING:  Should only be flagged at inflection points (src/pair_table.cpp:463)
WARNING: 1 of 2550 force values in table HN2 are inconsistent with -dE/dr.
WARNING:  Should only be flagged at inflection points (src/pair_table.cpp:463)
WARNING: 1 of 2550 force values in table HC are inconsistent with -dE/dr.
WARNING:  Should only be flagged at inflection points (src/pair_table.cpp:463)
WARNING: 1 of 2550 force values in table OC are inconsistent with -dE/dr.
WARNING:  Should only be flagged at inflection points (src/pair_table.cpp:463)
WARNING: 1 of 2550 force values in table CC are inconsistent with -dE/dr.
WARNING:  Should only be flagged at inflection points (src/pair_table.cpp:463)
PPPM initialization ...
  using 12-bit tables for long-range coulomb (src/kspace.cpp:342)
  G vector (1/distance) = 0.2053412
  grid = 1250 225 216
  stencil order = 5
  estimated absolute RMS force accuracy = 0.00040335986
  estimated relative force accuracy = 1.2147062e-06
  using double precision KISS FFT
  3d grid and FFT values/proc = 326025 262500
Generated 0 of 10 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 10000, page size: 100000
  master list distance cutoff = 17
  ghost atom cutoff = 17
  binsize = 8.5, bins = 353 37 34
  2 neighbor lists, perpetual/occasional/extra = 2 0 0
  (1) pair table, perpetual
      attributes: half, newton on
      pair build: half/bin/newton
      stencil: half/bin/3d
      bin: standard
  (2) pair coul/long, perpetual, copy from (1)
      attributes: half, newton on
      pair build: copy
      stencil: none
      bin: none
Setting up Verlet run ...
  Unit style    : real
  Current step  : 0
  Time step     : 1
Per MPI rank memory allocation (min/avg/max) = 44.44 | 143.5 | 1173 Mbytes
   Step         Volume          Temp          Press          PotEng         KinEng         TotEng         E_pair         E_vdwl         E_coul         E_long         E_mol          E_bond        E_angle        E_dihed        E_impro        Enthalpy       Density           Lx             Ly             Lz            Pxx            Pyy            Pzz            Pxy            Pxz            Pyz      
         0   2.6523283e+08  300            2.6378495      36946112       2202093.1      39148205       24758992      -1612261.7      48038374      -21667120       12187119       1360311.1      9412618.8      1140727.5      273462.1       39158408       0.16306379     2994.2991      311.11387      284.71656      6.2742822     -1.788931       3.4281972     -6.1787536      2.3116831     -4.8885717    
        10   2.6523283e+08  299.83626      9.98711        36947763       2200891.2      39148654       24756707      -1612208.7      48036376      -21667461       12191056       1367223.6      9411297.1      1140045.1      272490.62      39187286       0.16306379     2994.2991      311.11387      284.71656      13.895924      4.5831824      11.482224     -2.2163294      3.6003759     -0.78603315   
        20   2.6523283e+08  299.35759     -11.865279      36951662       2197377.6      39149039       24756739      -1611055.3      48035291      -21667497       12194923       1372391.6      9410431.9      1139068.8      273030.68      39103143       0.16306379     2994.2991      311.11387      284.71656     -19.61508      -9.0481286     -6.9326276     -11.11387       3.367963      -1.0031891    
        30   2.6523283e+08  300.1051       6.995051       36945890       2202864.6      39148755       24756412      -1612260.1      48036355      -21667683       12189478       1365388.9      9413046.2      1138014.1      273028.84      39175813       0.16306379     2994.2991      311.11387      284.71656      0.68808575     12.766271      7.5307965      5.2972384     -1.1891102      9.0455208    
        40   2.6523283e+08  300.64744      11.652021      36941819       2206845.5      39148664       24758447      -1612456.9      48038482      -21667578       12183371       1359611.1      9414344.2      1137502.1      271914.05      39193736       0.16306379     2994.2991      311.11387      284.71656      15.850939      19.198918     -0.093792836   -1.8595567     -1.6401838     -4.1642158    
        50   2.6523283e+08  299.9803      -0.51148605     36947190       2201948.5      39149138       24755841      -1612450.1      48035929      -21667638       12191349       1368805.3      9411125.5      1138873.2      272544.65      39147160       0.16306379     2994.2991      311.11387      284.71656      6.6752573      3.5086889     -11.718404      0.94091964    -3.6962051     -3.4800126    
        60   2.6523283e+08  299.59605     -4.6459544      36950281       2199128        39149409       24753148      -1613406.4      48034191      -21667636       12197132       1369112.1      9415164        1140071.1      272785.01      39131437       0.16306379     2994.2991      311.11387      284.71656     -1.2714564     -3.1252101     -9.5411966      0.50867841    -7.2430187     -1.0823207    
Loop time of 277.024 on 256 procs for 60 steps with 2462523 atoms

Performance: 0.019 ns/day, 1282.517 hours/ns, 0.217 timesteps/s
98.5% CPU use with 256 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.00054973 | 18.851     | 226        |1338.0 |  6.80
Bond    | 0.00027961 | 0.1617     | 1.9056     | 122.2 |  0.06
Kspace  | 40.228     | 248.53     | 267.65     | 370.8 | 89.72
Neigh   | 6.3795     | 6.5304     | 6.5617     |   1.6 |  2.36
Comm    | 0.00031963 | 0.20402    | 1.9467     | 102.8 |  0.07
Output  | 0.68858    | 0.70533    | 0.72528    |   1.0 |  0.25
Modify  | 0.27414    | 1.6694     | 2.0285     |  39.4 |  0.60
Other   |            | 0.3691     |            |       |  0.13

Nlocal:        9619.23 ave      106790 max           0 min
Histogram: 232 0 0 0 0 0 0 0 8 16
Nghost:        13308.5 ave      151654 max           0 min
Histogram: 224 0 8 0 0 0 8 0 8 8
Neighs:    9.88618e+06 ave 1.13329e+08 max           0 min
Histogram: 232 0 0 0 0 0 0 8 0 16

Total # of neighbors = 2.5308608e+09
Ave neighs/atom = 1027.7511
Ave special neighs/atom = 5.4285714
Neighbor list builds = 2
Dangerous builds = 0
System init for write_data ...
PPPM initialization ...
  using 12-bit tables for long-range coulomb (src/kspace.cpp:342)
  G vector (1/distance) = 0.2053412
  grid = 1250 225 216
  stencil order = 5
  estimated absolute RMS force accuracy = 0.00040335986
  estimated relative force accuracy = 1.2147062e-06
  using double precision KISS FFT
  3d grid and FFT values/proc = 326025 262500
Generated 0 of 10 mixed pair_coeff terms from geometric mixing rule
1a
1b
1c
1d
2
Iteration Number: 1
4
System init for write_data ...
PPPM initialization ...

It is near impossible to debug problematic code from just looking at it.

There are three things that come to my mind right now:

  • it is a bad idea to try and read and parse a file that is still being written to. Since writing to files is usually block buffered and when running in parallel on a cluster with a parallel/networked file system the state of the file contents on each node may be different until the file is closed.
  • your python code only executes on a single rank. however, the result must be globally visible and consistent on all MPI ranks.
  • what you are trying to do can be done much more effectively with regular equal style variables and using fix halt.

The second point may explain a stalling calculation, as LAMMPS may be stuck on an collective MPI call because different ranks selected different branches and thus there may be MPI ranks not calling the collective operation as expected.

However, if you want some serious debugging, you will need to reduce your input deck that has as little code as possible and a few atoms as possible to reproduce the issue. Then one can apply a variety of debugging techniques. For a stalling MPI application, it is often very useful to attach a debugger to all of the running processes (e.g. with: gdb -p ) and then get a stack trace to identify where all of the MPI ranks are currently in the source. That could quickly confirm my hypothesis about a likely cause from above.

Even if you want to access the pressure data from within the LAMMPS python module, the proper way would be to use the “extract_compute()” method to retrieve the current value(s) from the compute pressure instance with the ID “thermo_press”. That would take care of the first possible issue I mentioned.

2 Likes