Two runs do not have the same results when kokkos is installed

Ok!
Like this?

Well, it would have been better to properly re-send your e-mail.

At any rate, you are not really explaining specifically what result differences you are talking about. I have to guess, that you are referring to the output of temperature.

Now, your statement of getting the same results is correct in principle, but you are not accounting for the specifics of floating point math and the fact, that it is not associative. This means, the value of sums depends on the order of how numbers are summed up. If the result is close to 0, this can lead to significant relative differences (not absolute). On GPUs specifically, you have no control over how threads are scheduled and thus the order of operations. While you are getting numbers, that differ, you should have seen, that those are tiny. Set the initial temperature to a small value like 0. 1K. Do you still see “large” differences?

Axel

i finally had the opportunity to download the attachments, you sent to me personally.

it is extremely frustrating to see, that those are very different from what is produced by the demo input before.
the fact, that you don’t explain specifically, when and where the difference is happening and how large it is, makes this even more irritating.

so, in the files and outputs, that you are actually taking as evidence of a bug, the situation is very different from the example you posted.
there you do have an initial temperature assigned (and possibly are using an incorrectly constructed data file).

nevertheless the situation is effectively the same as before. since on GPUs you have no control over how data is assigned to threads and summed up (that is done be the hardware thread manager inside the GPU) your execution is not 100% deterministic, thus the issues with non-associative floating-point math still do apply, only the impact is more subtle. for about 1000 MD steps, you get identical results as per the precision of the screen output, and then you have slowly creeping in the exponential divergence, that you have with any chaotic system (i.e. where you have coupled linear partial differential equations).

in short, unless you reprogram the GPU kernels (or all of LAMMPS) do to fixed point math (or uses scaled integers), you will always have to expect this behavior when you cannot enforce the order of operations. you have the same kind of divergence, when using a different number of MPI ranks or CPU threads. however, how much visible this is, depends on many details, though.

i should also add, that this is not at all a new observation and this kind of thing has been discussed on this very mailing list many times. perhaps a more thorough look into the mailing list archives could have saved you a lot of time and trouble.

axel.

p.s.: in the future, please try to be more accurate and specific in the inputs you provide and claims you make, and everybody will benefit.

Hello Dr. Kohlmeyer:

Thank you for your reply, especially spending your time during the weekend. I apologize if I offended you in some way.

I need to explain that I did mention in the last email I would send a “less simplified” version of my input script and the corresponding thermo outputs, which are closer to the simulation I am working on.

I did look into other posts in the mailing list, I think my case is different because they have different versions of LAMMPS, or different simulation setups like a different number of MPI tasks, or different acceleration packages installed. In my case, I ran this script on the same node, with the same executable, the only variable of those two runs I knew is time, and I did not think there is any randomness in tests I did. But if the randomness comes from the GPU thread manager then I am wrong, and I appreciate the new knowledge that unlike CPUs, GPU computations are not 100% deterministic.

Besides, I ran the same input script with the same data file twice on CPU nodes (MPI), they produce identical results even after 18000 steps. I am no expert in high-performance computing, I thought this is a “bug” because identical thermo outputs can be expected according to the two CPU runs.

In summary, we can enforce the order of operations on CPU simulations, so we have identical nonassociative math errors every time thus no divergence, but we cannot enforce it on GPUs, the hardware thread manager behaves differently every time I submit a job? In other words, the randomness comes from the data assignment and summation of GPU kernels. Am I correct?

If the statement above is correct, then I would just test the time-averaged results, I am happy if they are the effectively the same.

Thanks.

Jianlan

Hello Dr. Kohlmeyer:

Thank you for your reply, especially spending your time during the weekend. I apologize if I offended you in some way.

I need to explain that I did mention in the last email I would send a “less simplified” version of my input script and the corresponding thermo outputs, which are closer to the simulation I am working on.

I did look into other posts in the mailing list, I think my case is different because they have different versions of LAMMPS, or different simulation setups like a different number of MPI tasks, or different acceleration packages installed. In my case, I ran this script on the same node, with the same executable, the only variable of those two runs I knew is time, and I did not think there is any randomness in tests I did. But if the randomness comes from the GPU thread manager then I am wrong, and I appreciate the new knowledge that unlike CPUs, GPU computations are not 100% deterministic.

Besides, I ran the same input script with the same data file twice on CPU nodes (MPI), they produce identical results even after 18000 steps. I am no expert in high-performance computing, I thought this is a “bug” because identical thermo outputs can be expected according to the two CPU runs.

but change the number of CPUs used, or use load balancing, and the trajectories will eventually diverge as well.

In summary, we can enforce the order of operations on CPU simulations, so we have identical nonassociative math errors every time thus no divergence, but we cannot enforce it on GPUs, the hardware thread manager behaves differently every time I submit a job? In other words, the randomness comes from the data assignment and summation of GPU kernels. Am I correct?

If the statement above is correct, then I would just test the time-averaged results, I am happy if they are the effectively the same.

you are missing an important concept of statistical mechanics here. it does NOT matter, if those trajectories are diverging for as long as they are sampling the same phase space. due to rounding and floating point math truncation you make an error EVERY time. in fact, you will get a better assessment of the statistical relevance of your results if you start multiple calculations from statistically uncorrelated initial conditions and then combine the results.

axel.

​Atomic writes on GPUs are not guaranteed to happen in the same order, so this can lead to diverging trajectories over time. As long as the thermo output on timestep 0 is very close between runs and the output for ~100 timesteps is close, it is probably fine. If you run fix langevin on GPUs then the random number sequence will be different every time.​ Some operations on the CPU are not deterministic either, for example when the box is too deformed and flips, it can produce the same effect in LAMMPS due to MPI operations in a different order.

Stan