Dear developers
I am using LAMMPS version 27 June 2024 on Intel Gold 56 cores server.
I am using reaxFF force field and use the command fix reaxff/species
fix 2 all reaxff/species 2 6 1000 ${rad}_output${T}K/species_${myfile}_${seed1} element C Cl H N O S
I am running a lammps script using partition flag -p 50x1
So 50 species _ files are successfully created in the targeted folder.
But some of them have a structure problem that appears randomly.
The problem is that the # time line is not written at a beginning of a line.
Below is a part of a faulty species_ file
# Timestep No_Moles No_Specs C22H23O9N2 H2O H3O2 HO O
1000 58 5 1 41 7 8 1
# Timestep No_Moles No_Specs C22H23O10N2 H2O HO H3O2 H4O2
2000 57 5 1 40 9 6 1
# Timestep No_Moles No_Specs C21H19O9N2 CH3O H4O2 H2O HO H5O3 H3O2
3000 54 7 1 1 1 # Timestep No_Moles No_Specs C22H22O9N2 H2O H3O2 HO
4000 56 4 1 39 9 7
# Timestep No_Moles No_Specs C22H24O12N2 H2O H5O3 HO H3O2 H
5000 55 6 1 40 2 7 4 1
# Timestep No_Moles No_Specs C22H22O11N2 H2O H3O2 H3O3 H HO
6000 56 6 1 43 6 1 1 4
# Timestep No_Moles No_Specs C22H22O11N2 H2O H3O H3O2 H2O2 H HO
7000 56 7 1 40 1 7 1 1 5
# Timestep No_Moles No_Specs C22H22O11N2 H2O H3O2 HO H
8000 55 5 1 40 9 4 1
# Timestep No_Moles No_Specs C22H20O11N2 H2O H3O2 HO
9000 58 4 1 47 5 5
As you can see, it happens, in this case, at timestep 3000. But, at the end of this line the next # timestep … line is not starting at the beginning of the next row as for the other couple of output lines.
It is a problem when I wish to read all the generated files for post-treatment purpose. My Python code dedicated to identifying generated species and doing averaging of recorded values at each timestep fails thus to read faulty files.
Have you an idea from the origin of this random error ?
Thanks for you help.
Best regards
Pascal
Some questions:
- What platform are you running on?
- Does the same effect happen with fewer replica? If yes, how few?
- Does the same effect happen on a different platform?
- Do you observe the same issue if you modify the input for the RDX or TATB example from the LAMMPS distribution accordingly?
If you can reproduce it with RDX/TATB, then please share the input deck. Otherwise, provide that input deck you are using.
At the moment, I can only speculate about the reason for the difference.
Can you also please attach a compressed version of a corrupted file (please compress with zip or gzip right where you simulate and then transfer/attach the file).
Dear Axel
Thanks a lot for your advices.
I am running with updated (few days ago) Centos 7.9, with gcc 11
I have checked the 50 species files (species_doxycycline_number) and the corresponding log files in the attached archive.
There are 11 corrupted files over the 50 generated as species and log files. They are those with 258728, 336234, 373869, 405468, 475319, 507009, 595634, 613828,743873, 764643, 773440 as number character chain in file name.
I also check that there is a problem in the corresponding log files (doxycycline_number.log). In both cases number refers to a seed value used in the script. It allows to identify log file associated with species file.
Problems are : Some output values are not printed in species and log files, carriage return error,
It suggest this a problem with the computer. I should notice that it does not occur on my windows computer running with -p 10x1. I cannot run 50x1 on it. I will submit it on an AMD Geona computer for comparison under debian Linux. But first of all I will redo the calculation for checking if the errors are also randomly occuring .
thanks a lot in advance
Pascal
HO_output1000K.zip (262.1 KB)
The corruption that you see in the log files is a strong hint that you had two processes writing to the same file. This could also explain the unexpected behavior of fix reaxff/species output files.
Thanks a lot Axel
I will check. I replay the calculation and found the errors at same places. I run it also without kokkos enabled and the error occured but at different timestep. I have ran it remotely with and without nohup for checking if background operation play a role. Same error occured.
I will check if my script is able to allow such behaviour.
Below are the commands:
# Organochloride advanced oxidation
# This version executes 50 runs with different initial conditions for one molecule, one radical and one temperature
# mpirun -np 50 lmp -var myfile doxycycline -var rad HO -var T 1000 -p 50x1 -k on -sf kk -in Organochloride_HO-H2O.lammps
# Each job runs on a single mpi task
#
# Settings
echo both
units real # t= fs; L= A; v =A/fs,; E = kcal/mol
dimension 3
boundary p p p
atom_style charge
#
# Input variable definition
# Different initial conditions : here 50 different sets of seeds for molecules location and orientations, define 50 runs.
variable seed1 world 457297 679560 464378 533583 683578 717732 546440 392254 88993 57341 214814 37932 738574 747376 743873 12109 855462 585730 361279 477688 674176 475319 481690 562015 531934 732934 191346 456669 147216 447568 595634 336234 773440 258728 373869 405468 507009 764643 613828 47568 595634 336234 773440 258728 373869 405468 507009 764643 613828 901033
variable seed2 world 648688 779971 476638 393257 695979 17438 679636 718766 303177 16064 598231 696570 443079 106265 642989 94057 450420 569829 228088 61003 419723 975553 545607 848464 112307 207999 402768 82910 928663 559699 138609 393002 899716 454217 845918 704536 845474 995786 20591 414227 270766 268298 881858 471881 360223 587249 872812 760449 88040 574702
variable seed3 world 966749 361726 692867 690610 878076 192171 976121 872511 867498 669079 593461 252681 203555 22811 144885 457195 469102 540218 333279 385879 615085 345048 615618 114381 641393 807834 402028 773040 577949 185725 389609 843356 87463 454757 164797 290757 800172 344521 491688 191036 625731 350407 469155 846875 55858 256163 85875 204857 778405 563375
variable R string rad
variable file string myfile
variable temperature string T
shell mkdir ${rad}_output${T}K
# Log file
log ${rad}_output${T}K/${myfile}_${seed1}.log
# molecules in the box
molecule radical ${rad}
molecule water H2O
molecule organochlo ${myfile}
# Simulation box
region mybox block 0 15 0 15 0 15
create_box 6 mybox
create_atoms 0 single 7.5 7.5 7.5 mol organochlo ${seed2}
create_atoms 0 random 45 ${seed2} mybox mol water ${seed3} overlap 2.0 maxtry 200
create_atoms 0 random 20 ${seed3} mybox mol radical ${seed1} overlap 2.0 maxtry 200
....
# identify species created/destroyed
fix 2 all reaxff/species 2 6 1000 ${rad}_output${T}K/species_${myfile}_${seed1} element C Cl H N O S
Thanks again for your help
Best.
Pascal
You are using these values for seed1 which is also used in the file name of the species file and the log file. But some of those seeds are repeated, so there are only 41 unique numbers instead of 50. Thus some simulations are using the same seed1 value and are writing to the same output files. There should be only 41 of those.
Thanks a lot Axel
I did not think about such a problem, being sure that my python code generating randomly such seeds can not generate multiple time the same seed.
I will modify it for avoiding such a problem.
Thanks again
Best
Pascal