BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 9 PID 37906 RUNNING AT comp-hc-0001 = KILLED BY SIGNAL: 9

Hello,

I am trying to make an ion irradiation run on a supercomputer of my university. So I started with a W supercell with 7685 atoms. I started with energy minimization input file and then used the output of it to thermalize the system at 300 K and then I wanted to open the system from the Z direction. My supervisor asked me to change the boundaries of the z direction by -5 from zlo and +30 from zhi so that we can shoot trajectories from above the surface atoms. The problem is that when I open the system and use NVE to relax the system, the simulation crashes and causes a memory issue. I troubled shooting to try to understand the problem, I found that the problem is with the “zlo” boundary. I tried to make it -1.5 zlo and the relaxation was done. I tried different values from -2 to -5 zlo and it all crashed.

Relaxation file:

Energy: kcal/mol=0.043eV; Distance: Angstrom; Mass: g/mole=a.m.u.; Time: fs.

units real
atom_style charge
boundary p p f
############################################################
#BCC structure
read_data thermal.dat
##########################################################
pair_style reax/c NULL checkqeq no
pair_coeff * * ./ffield W
#fix 1 all qeq/reax 1 0.0 10.0 1.0e-6 reax/c # for the charging info
##########################################################

Output Configuration

Compute the energy per atom

Output x, y, z of atom LAMMPS standar format

#thermo_modify lost ignore flush yes
dump 2 all xyz 200 w300r.xyz
#dump 10 all xyz 1000 wjmol.xyz
#write_data w4.lmp
#dump 3 all custom 500 form.dat id type q x y z
#dump 4 all custom 500 velocity.dat id vx vy vz
#############################################################

Dynamics

#############################################################
#fix 5 butt setforce 0.0 0.0 0.0
###########################################
thermo 500
timestep 0.5
thermo_style custom step ke etotal temp
thermo_modify lost ignore flush yes
fix NVT all nvt temp 300.0 300.0 50.0
run 10000

NVE integration to update position and velocity for atoms in the group each timestep.

unfix NVT
fix 2 all nve
#velocity all scale 300.0
run 20000
write_data w4.lmp

Pbs file:
#!/bin/bash
#PBS -A open
#PBS -l walltime=03:00:00
#PBS -l nodes=2:ppn=8
#PBS -j oe
#PBS -N pmi_wh_relax

Request 8 gigabyte of memory per process

#PBS -l pmem=112gb
ulimit -s unlimited

module purge
module use /gpfs/group/RISE/sw7/modules
#module load intel/19.1.2
#module load tbb/2020.8
#module load impi/2019
module load lammps/20200303
wait
cd /gpfs/group/jma6442/default/doecolab/tung/Meral/small/relax/trials/-15
#export OMP_NUM_THREADS=8

cd $PBS_O_WORKDIR
echo “Starting time: date
mpirun lammps < in.relax >> datar.dat
done
echo “Finishing time: date

I hope you can direct me of what I am doing wrong please.

Thank you in advance,
Screen Shot 2022-08-03 at 3.08.05 PM

Meral

Impossible to say from the information provided.
We also need to see the content of the “datar.dat” file and the output from your batch system (typically that would be two files named pmi_wh_relax.e##### and pmi_wh_relax.o##### where ##### is a number corresponding to the job id of the particular job.

Sorry i am not able to upload the files as i am a new user. Please find below the contents of the files:

Pmi_wh_relax.o

Starting time: Wed Aug 3 01:56:41 EDT 2022
[comp-hc-0001:37900:0:37900] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x4dd70000940c)
[comp-hc-0001:37904:0:37904] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x4dd700009410)
[comp-hc-0001:37908:0:37908] Caught signal 11 (Segmentation fault: tkill(2) or tgkill(2) at address 0x4dd700009414)
==== backtrace (tid: 37908) ====
0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
1 0x0000000000a44634 BOp() reaxc_bond_orders.cpp:0
2 0x0000000001624cdd Init_Forces_noQEq() ???:0
3 0x0000000001624835 Compute_Forces() ???:0
4 0x000000000281d1c6 LAMMPS_NS::PairReaxC::compute() pair_reaxc.cpp:0
5 0x0000000001f003d2 LAMMPS_NS::Verlet::run() ???:0
6 0x00000000015a2d0c LAMMPS_NS::Run::command() ???:0
7 0x00000000006e5af9 LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>() ???:0
8 0x00000000006dcf7c LAMMPS_NS::Input::execute_command() ???:0
9 0x00000000006dea78 LAMMPS_NS::Input::file() ???:0
10 0x000000000249160c main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x0000000000411d69 _start() ???:0

==== backtrace (tid: 37904) ====
0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
1 0x0000000000a44634 BOp() reaxc_bond_orders.cpp:0
2 0x0000000001624cdd Init_Forces_noQEq() ???:0
3 0x0000000001624835 Compute_Forces() ???:0
4 0x000000000281d1c6 LAMMPS_NS::PairReaxC::compute() pair_reaxc.cpp:0
5 0x0000000001f003d2 LAMMPS_NS::Verlet::run() ???:0
6 0x00000000015a2d0c LAMMPS_NS::Run::command() ???:0
7 0x00000000006e5af9 LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>() ???:0
8 0x00000000006dcf7c LAMMPS_NS::Input::execute_command() ???:0
9 0x00000000006dea78 LAMMPS_NS::Input::file() ???:0
10 0x000000000249160c main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x0000000000411d69 _start() ???:0

==== backtrace (tid: 37900) ====
0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
1 0x0000000000a44634 BOp() reaxc_bond_orders.cpp:0
2 0x0000000001624cdd Init_Forces_noQEq() ???:0
3 0x0000000001624835 Compute_Forces() ???:0
4 0x000000000281d1c6 LAMMPS_NS::PairReaxC::compute() pair_reaxc.cpp:0
5 0x0000000001f003d2 LAMMPS_NS::Verlet::run() ???:0
6 0x00000000015a2d0c LAMMPS_NS::Run::command() ???:0
7 0x00000000006e5af9 LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run>() ???:0
8 0x00000000006dcf7c LAMMPS_NS::Input::execute_command() ???:0
9 0x00000000006dea78 LAMMPS_NS::Input::file() ???:0
10 0x000000000249160c main() ???:0
11 0x0000000000022555 __libc_start_main() ???:0
12 0x0000000000411d69 _start() ???:0

/var/spool/torque/mom_priv/jobs/38281305.torque01.util.production.int.aci.ics.psu.edu.SC: line 27: syntax error near unexpected token done' /var/spool/torque/mom_priv/jobs/38281305.torque01.util.production.int.aci.ics.psu.edu.SC: line 27: done’

data.dat
LAMMPS (3 Mar 2020)
using 1 OpenMP thread(s) per MPI task
Reading data file …
orthogonal box = (-0.5 -0.5 -2.5) to (34.8 34.8 134)
2 by 2 by 4 MPI processor grid
reading atoms …
7986 atoms
reading velocities …
7986 velocities
read_data CPU = 0.0486756 secs
WARNING: Changed valency_val to valency_boc for X (…/reaxc_ffield.cpp:315)
Neighbor list info …
update every 1 steps, delay 10 steps, check yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 12
ghost atom cutoff = 12
binsize = 6, bins = 6 6 23
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair reax/c, perpetual
attributes: half, newton off, ghost
pair build: half/bin/newtoff/ghost
stencil: half/ghost/bin/3d/newtoff
bin: standard
Setting up Verlet run …
Unit style : real
Current step : 0
Time step : 0.5
Per MPI rank memory allocation (min/avg/max) = 36.09 | 90.96 | 124.7 Mbytes
Step KinEng TotEng Temp
0 7175.6537 -1565884.7 301.47566
500 7557.4924 -1566666.8 317.51811
1000 6965.243 -1567701.6 292.63553
1500 6992.4782 -1568203.8 293.77978
2000 7382.3586 -1568895.7 310.16009
2500 7277.0825 -1569670.7 305.73706
3000 6922.9652 -1569454.3 290.85928
3500 7157.462 -1568827.5 300.71136
4000 7505.7321 -1568334.4 315.34346

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 2092 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 2093 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 2094 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 2095 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 2096 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 2097 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 2098 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 2099 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 8 PID 2100 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 9 PID 2101 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 10 PID 2102 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 11 PID 2103 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 12 PID 2104 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 13 PID 2105 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 14 PID 2106 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 15 PID 2107 RUNNING AT comp-hc-0001
= KILLED BY SIGNAL: 9 (Killed)

OK, thanks.

This is an error from ReaxFF. Difficult to say whether this is due to the ReaxFF implementation or due to you not using it correctly. Since your version of LAMMPS is quite old and we did some fixes and updates to the ReaxFF implementation since, you may want to consider updating to the latest LAMMPS version.

Another thing to try is to compile a LAMMPS version with the KOKKOS package enabled for Serial (or OpenMP). The memory management in the LAMMPS ReaxFF implementation is very sensitive to significant changes in the geometry. The KOKKOS version has a more robust memory management.
There are some parameters like “safezone” that you can boost to make it more tolerant.