NEB He diffusion in W grain boundary

j.srecio · September 1, 2022, 10:24pm

Hi (I’m using lammps-23Jun2022),

I am studying the migration of W (SIA, Self Interstitial Atom) and He (LIA, Light Impurity Atom) in a W grain boundary (GB). I started studying the migration of the SIA and I had no problem running the code. However, when trying to run the same configuration, but this time with a He, I got an error and I don’t understand why. My input script, that is the same for the SIA and light case, it’s the following:

# NEB simulation 

units           metal

atom_style      atomic
atom_modify     map array
boundary        p p p
atom_modify     sort 0 0.0

#------------------------------------------------------------------
#  Define simulation box.
#------------------------------------------------------------------

read_data initial.lmp

#------------------------------------------------------------------
#   Define Interatomic Potential
#------------------------------------------------------------------

mass      1 183.846
mass      2 4.003
mass      3 1.00784

pair_style hybrid/overlay eam/alloy table linear 10000  lj/cut 7.913
pair_coeff   * * eam/alloy WHfff.eam.alloy W NULL H
pair_coeff 1 2 table W-He-Juslin.table WHe
pair_coeff 2 2 table He-Beck1968_modified.table HeHe
pair_coeff 2 3 lj/cut 5.9225E-4 1.333


# set up neb run

variable        u uloop 48

# fixed atoms

region bottom block INF INF INF INF 0 3
region top block INF INF INF INF 18 20.5
group bottom region bottom
group top region top
fix freeze1 bottom setforce 0.0 0.0 0.0
fix freeze2 top setforce 0.0 0.0 0.0


# initial minimization to relax vacancy

minimize        1.0e-6 1.0e-4 10000 10000

fix             1 all neb 1.0

thermo          100

# run NEB 

timestep        0.01
min_style       fire

neb             0.0 1e-4 1000 1000 10 final final.lammpstrj
unfix 1
write_dump all custom dump.neb.w.$u id type x y z
run 0

In fact, the main difference between the two cases, it’s that in the initial configuration file (‘initial.lmp’), the He atom is type 2. Obviously, in the final file I have no type. (Also, both the initial and final configurations are previously relaxed). However, when I run it I get this error message after some timesteps (in this specific example, at step 301 it breaks):

remove mkl/2017.4 (LD_LIBRARY_PATH)
remove impi/2017.4 (PATH, MANPATH, LD_LIBRARY_PATH)
Set GNU compilers as MPI wrappers backend
load impi/2017.4 (PATH, MANPATH, LD_LIBRARY_PATH)
load mkl/2017.4 (LD_LIBRARY_PATH)
Fatal error in PMPI_Wait: Message truncated, error stack:
PMPI_Wait(219)....................: MPI_Wait(request=0x7fff41b0eb60, status=0x1) failed
MPIR_Wait_impl(100)...............: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 8 and tag 0 truncated; 10968 bytes received but buffer size is 10944
Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x5015640, count=1368, MPI_DOUBLE, src=11, tag=0, MPI_COMM_WORLD, request=0x7ffcc2c02d70) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 10968 bytes received but buffer size is 10944
Fatal error in PMPI_Wait: Message truncated, error stack:
PMPI_Wait(219)....................: MPI_Wait(request=0x7ffef4971520, status=0x1) failed
MPIR_Wait_impl(100)...............: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 15 and tag 0 truncated; 10968 bytes received but buffer size is 10944
Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x5d87440, count=1368, MPI_DOUBLE, src=18, tag=0, MPI_COMM_WORLD, request=0x7ffc31eade70) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 10968 bytes received but buffer size is 10944
Fatal error in PMPI_Wait: Message truncated, error stack:
PMPI_Wait(219)....................: MPI_Wait(request=0x7ffd55f448c0, status=0x1) failed
MPIR_Wait_impl(100)...............: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 22 and tag 0 truncated; 10968 bytes received but buffer size is 10944
Fatal error in PMPI_Wait: Message truncated, error stack:
PMPI_Wait(219)....................: MPI_Wait(request=0x7ffeb5efe900, status=0x1) failed
MPIR_Wait_impl(100)...............: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 23 and tag 0 truncated; 10968 bytes received but buffer size is 10944
Fatal error in PMPI_Wait: Message truncated, error stack:
PMPI_Wait(219)....................: MPI_Wait(request=0x7ffe472cd870, status=0x1) failed
MPIR_Wait_impl(100)...............: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 36 and tag 0 truncated; 10968 bytes received but buffer size is 10944
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 25069041.0 ON s23r2b12 CANCELLED AT 2022-09-01T23:36:59 ***
srun: error: s23r2b12: tasks 0,2-15,17-35,37-39,41,43,45-47: Killed
srun: launch/slurm: _step_signal: Terminating StepId=25069041.0
srun: error: s23r2b12: task 40: Killed
srun: error: s23r2b12: task 16: Killed
srun: error: s23r2b12: task 1: Killed
srun: error: s23r2b12: tasks 36,42: Killed
srun: error: s23r2b12: task 44: Killed

The problem is with the He atom, because if I change just it to type 1 (W), it runs perfectly. Any suggestion? Sorry if i didn’t sum it up enough!

Jorge

akohlmey · September 1, 2022, 10:33pm

I don’t quite understand your description. All replica need to have the same number and types of atoms, so it is not quite clear what you say about the He atom in the last replica.

j.srecio · September 1, 2022, 10:39pm

Sorry if I explained myself wrong. I meant that since in the last replica you don’t have to specify the type (only ID, x, y, z), I didn’t change anything in that file. I was just saying it to point out that the only ‘number’ I changed was in the initial file.I don’t understand why changing the type affects mpi. I thought maybe it’s because of how I have defined the potential.

akohlmey · September 2, 2022, 2:11am

What could happen is that when creating the initial path, some bad geometry is generated which will then get worse as the individual minimizations proceed.

One way to check this would be to use “write_dump” (or “write_data”!) before the neb command so you get the starting geometry for each replica and can perform individual minimizations for each of them and then check if any of those would cause problems.

Another approach to debug this could be change the number of replicas used and see if this has an impact.