Strange behavior and result by using Cuda

Dear users & developers

I am trying to do tension simulation for Al by using cuda.
The problem is:
Using Cuda, the result is totally different with the one using MPI.
(Pxx,Pyy,Pzz,lx,ly,lz etc…)

Is there something I missed? Or, Is this a bug of LAMMPS?
Any help fixing this would be greatly appreciated.

I used Lammps tutorial input file[https://icme.hpc.msstate.edu/mediawiki/index.php/Uniaxial_Compression]. and changed it a little

Here is the input file & output.

I would start with the Q: if I run the indentical
script on 1 processor with no-CUDA (CPU) vs CUDA

do you get the same answer and if so, for how many

timesteps.

Steve

Dear users & developers

I am trying to do tension simulation for Al by using cuda.
The problem is:
Using Cuda, the result is totally different with the one using MPI.
(Pxx,Pyy,Pzz,lx,ly,lz etc..)

Is there something I missed? Or, Is this a bug of LAMMPS?

​this is likely a bug in the USER-CUDA package, but looking at your input,
there is little reason to use USER-CUDA.

with this small a number of atoms and due to the use of fix print, your use
of the GPU will be quite inefficient. you are likely to run faster with MPI
or using the GPU package instead (or both).

axel.