Problem with GPU

Dear All,

I have compiled the last version of LAMMPS with GPU (GeForce RTX 4070 Ti). I run an input file with GPU, and it is run up to 140000 steps, and then it is terminated and gives me this error:
“Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 7 with PID 0 on node rese-pc-52 exited on signal 9 (Killed).”

I can run this input run with CPU without any problem. I should mention that this problem was also observed for other input files. Do you have any comment on my problem? I think that there is no problem in my input file, it should be something to the GPU.

Thanks in advance

This is an unscientifically imprecise description. Which version exactly? What constitutes the “last” depends on many factors and we have seen people claiming very different versions as “the last”, not to mention it depends on when somebody reads this.

GPU package or KOKKOS package? How compiled? With which settings?
What is the exact input? What is the command line? What is the hardware and OS?

When a run continues for so many steps instead of terminating immediately, the reason for termination may already be visible before the exact error. Also, it can be independent from LAMMPS (e.g. due to overheating). In addition, the error message you quote is from the MPI library and thus providing no useful information at all about what caused the crash.

Which other input files? Also after a sizeable number of steps or immediately?
Is it always up to the same step number or different ones?
Can it also be reproduced with examples from the LAMMPS distribution, e.g. in the “bench” folder. If necessary, after changing the number of steps since the default is small.

There is not enough information here to make any meaningful assessment.

There is not enough information here that supports this conclusion. It can very well be a marginal input that will then fail when run on a different architecture or with a slightly different implementation of the force kernels.

Signal 9 out of the blue is often the Linux OOM killer (out of memory) see Out Of Memory Management, so I’d check that you have enough RAM for the simulation first.