GPU and MPI give different results ?

Chen_Wei1 · June 24, 2013, 1:46pm

Hi, Lammps users,

I am using Lammps to study the water confined by graphene plates, I tested both GPU and MPI versions by same in file and same workstation, but I found they give different results for the forces.

I am interested to know the force on the top plate.

the GPU (lmp -in in.file -suffix gpu) gives:

Time-averaged data for fix AvgForce

TimeStep v_Fx v_Fy v_Fz

12000 -7.13704 -16.4764 -10.9221
13000 -5.14517 -15.1253 -6.36327
14000 -4.52943 -14.2805 -7.42559
15000 -5.48405 -14.9411 -5.76414
16000 -6.0178 -15.2261 -2.54366
17000 -6.48376 -15.3021 -3.01277
18000 -6.57296 -15.6543 -2.93381
19000 -6.55271 -15.8128 -2.21396
20000 -6.50738 -15.9492 -1.14217
21000 -6.64415 -16.0919 -1.06237

the MPI(mpirun -np 8 lmp -in in.file -suffix off) gives:

Time-averaged data for fix AvgForce

TimeStep v_Fx v_Fy v_Fz

12000 0.113459 -0.565165 -12.9022
13000 -0.496116 -0.384446 -0.972526
14000 -0.767956 0.0540231 -0.467098
15000 -0.555208 -0.0063562 -2.35734
16000 -0.461405 0.0502843 -2.87141
17000 -0.50697 0.133622 -2.45268
18000 -0.539162 0.117244 -3.78739
19000 -0.485516 -0.0477696 -2.94935
20000 -0.406805 -0.14959 -2.33821
21000 -0.366608 -0.0976216 -2.28078

Then, I tested the Lennard-Jones (12,6) liquid, both GPU and MPI give the same results,

GPU:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

190000 -8.71807 -8.58169 0.403003 0.609383 131.898 2.2943
191000 -8.61066 -8.47399 0.198354 0.40511 132.567 2.96295
192000 -8.52606 -8.38907 0.373179 0.580302 131.231 1.62694
193000 -8.5452 -8.4079 0.522183 0.729722 131.603 1.99896
194000 -8.79363 -8.65606 0.324429 0.53234 130.505 0.900753
195000 -8.99448 -8.85659 0.391631 0.599935 131.041 1.43689
196000 -9.11922 -8.98109 0.336555 0.545277 130.926 1.32213
197000 -9.14649 -9.00807 0.349229 0.558293 130.562 0.957633
198000 -9.18694 -9.04827 0.194978 0.404408 130.345 0.740839
199000 -9.371 -9.23209 0.442688 0.652438 129.218 -0.386337
200000 -9.61209 -9.47295 0.275291 0.485417 129.778 0.173577

MPI:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

190000 -8.86421 -8.86421 0.198975 0.198975 130.162 0.561966
191000 -8.95512 -8.95512 0.377872 0.377872 130.601 1.00124
192000 -9.04524 -9.04524 0.463869 0.463869 130.429 0.828658
193000 -9.26326 -9.26326 0.283051 0.283051 129.83 0.230195
194000 -9.23922 -9.23922 0.414481 0.414481 129.982 0.381742
195000 -9.20557 -9.20557 0.415045 0.415045 129.336 -0.264181
196000 -9.20473 -9.20473 0.417215 0.417215 130.188 0.587944
197000 -9.32355 -9.32355 0.302511 0.302511 129.508 -0.0921782
198000 -8.93576 -8.93576 0.319337 0.319337 128.712 -0.888238
199000 -9.15378 -9.15378 0.71008 0.71008 128.527 -1.07268
200000 -9.16993 -9.16993 0.655037 0.655037 128.936 -0.66446

Best wishes,

Wei

in+data+output.tar.bz2 (131 KB)

Chen_Wei1 · June 24, 2013, 4:44pm

In addition, for the Lennard-Jones (12,6) liquid, I implemented both " compute group/group" and “variable fcm” to compute the total force on the top plate. They are supposed to be same for the X and Y component of the force(there is a normal load on the Z direction) , MPI indeed outputs the same values by both methods, but GPU doesn’t.

sjplimp · June 25, 2013, 11:41am

The diagnostics and averaging you are doing has nothing to
do with GPU vs CPU. Only the pair style computation is
running on the GPU. So you must be getting different
dynamics for your system in the 2 cases. I assume that
the thermo output is different for the 2 runs? Is it
radically different from the start, or is it identical over
some time scale (few 1000 steps) and then diverges slowly?
The latter would be typical but wouldn’t explain why
the averaged force on the wall is so different over long
timescales.

Mike may wish to comment.

Steve

Chen_Wei1 · June 25, 2013, 12:03pm

Hi, Steve,

Thank you very much for your reply.

This time I tested three accelerated packages: MPI, GPU and USER-CUDA. Same .in file and same workstation, the total force on the top plate is calculated by using “variable fcm”. MPI and USER-CUDA present the similar results, the agreement between them is fine. But the force from GPU is much larger, see blow:

the GPU (lmp -in in.file -suffix gpu -cuda off) gives:

Time-averaged data for fix AvgForce

TimeStep v_Fx v_Fy v_Fz

…
19000 -7.58045 -17.3652 -1.10412
20000 -7.4679 -17.3854 -1.65396
21000 -7.60392 -17.5728 -1.81681

the MPI(mpirun -np 8 lmp -in in.file -suffix off -cuda off) gives:

Time-averaged data for fix AvgForce

TimeStep v_Fx v_Fy v_Fz

…
19000 -0.485516 -0.0477696 -2.94935
20000 -0.406805 -0.14959 -2.33821
21000 -0.366608 -0.0976216 -2.28078

the USER-CUDA (lmp -in in.file -suffix cuda -cuda on) gives:

Time-averaged data for fix AvgForce

TimeStep v_Fx v_Fy v_Fz

…
19000 -0.396212 -0.280724 -0.455589
20000 -0.486512 -0.114931 -1.24892
21000 -0.463173 -0.064694 -0.521873

I put the whole output files in the attachment.

In addition, I tested the simple Lennard-Jones (12,6) atomic liquid confined by two atomic walls,

I implemented both " compute group/group" and “variable fcm” to compute the total force on the top wall.

Both routes are supposed to give the same values for the X and Y component of the force(there is a normal load on the Z direction) ,

MPI and USER-CUDA indeed output the same values by both methods, but GPU doesn’t.

MPI

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

…
197000 -9.44235 -9.44235 -0.00165087 -0.00165087 130.309 0.708833
198000 -9.39362 -9.39362 0.062226 0.062226 131.623 2.02307
199000 -9.45229 -9.45229 -0.150994 -0.150994 131.047 1.4469
200000 -9.40386 -9.40386 -0.051843 -0.051843 130.23 0.630418

USER-CUDA

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

…
197000 -10.202 -10.202 -0.935844 -0.935844 130.133 0.533177
198000 -10.19 -10.19 -0.882492 -0.882492 130.132 0.532177
199000 -10.1893 -10.1893 -0.795529 -0.795529 129.504 -0.0960205
200000 -10.3029 -10.3029 -0.737312 -0.737312 129.078 -0.521517

GPU

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

…
197000 -9.14649 -9.00807 0.349229 0.558293 130.562 0.957633
198000 -9.18694 -9.04827 0.194978 0.404408 130.345 0.740839
199000 -9.371 -9.23209 0.442688 0.652438 129.218 -0.386337
200000 -9.61209 -9.47295 0.275291 0.485417 129.778 0.173577

However, it is interesting to notice that the difference among the packages is much smaller for atomic system without charges.

Best wishes

Wei

in+data+output.tar.bz2 (139 KB)

_Brown_W_Michael · June 25, 2013, 4:08pm

Your input script worked for me and gave similar results to MPI for the force.friction file, but I am using the April 29th version of LAMMPS. Nothing has changed in the GPU package on the LAMMPS site that would cause this, but I will run regression tests with the current version to see if some problem has been introduced...

- Mike

Chen_Wei1 · June 25, 2013, 5:18pm

Thank you for your reply.

I am using Feb 22th version. It is said on http://lammps.sandia.gov/bug.html that nothing has been changed in GPU package since 2 Jan 2013. Am I right ?

Best wishes,

Wei

_Brown_W_Michael · June 25, 2013, 6:40pm

I think only minor changes to keep up with the code base. I think that your version should work ok.

In your case, if you are running using GPU acceleration on an executable built with user-cuda enabled, you might try building without user-cuda. Clean everything and rebuild and then try.

- Mike

_Brown_W_Michael · June 25, 2013, 8:18pm

The regression tests all pass with the current version of LAMMPS. With the current version, both the CPU-only and GPU-accelerated runs lose atoms around timestep 9000 for your water+graphene input.

I don't think that there are any issues with using GPU acceleration.

On a side note for Steve, I did hit a couple of memcheck errors in fix move in the memory usage diagnostics unrelated to GPU (I don't think that these will have any affect on the dynamics though):

Setting up run ...
==26176== Conditional jump or move depends on uninitialised value(s)
==26176== at 0x72BE34: LAMMPS_NS::FixMove::memory_usage() (fix_move.cpp:808)
==26176== by 0x81CFBB: LAMMPS_NS::Modify::memory_usage() (modify.cpp:1195)
==26176== by 0x86FAF1: LAMMPS_NS::Output::memory_usage() (output.cpp:762)
==26176== by 0x8728F8: LAMMPS_NS::Output::setup(int) (output.cpp:252)
==26176== by 0xBE29A2: LAMMPS_NS::Verlet::setup() (verlet.cpp:145)
==26176== by 0xBAA370: LAMMPS_NS::Run::command(int, char**) (run.cpp:169)
==26176== by 0x7EBEEE: LAMMPS_NS::Input::execute_command() (run.h:16)
==26176== by 0x7EC799: LAMMPS_NS::Input::file() (input.cpp:202)
==26176== by 0x803744: main (main.cpp:30)
==26176==
==26176== Conditional jump or move depends on uninitialised value(s)
==26176== at 0x72BEAD: LAMMPS_NS::FixMove::memory_usage() (fix_move.cpp:809)
==26176== by 0x81CFBB: LAMMPS_NS::Modify::memory_usage() (modify.cpp:1195)
==26176== by 0x86FAF1: LAMMPS_NS::Output::memory_usage() (output.cpp:762)
==26176== by 0x8728F8: LAMMPS_NS::Output::setup(int) (output.cpp:252)
==26176== by 0xBE29A2: LAMMPS_NS::Verlet::setup() (verlet.cpp:145)
==26176== by 0xBAA370: LAMMPS_NS::Run::command(int, char**) (run.cpp:169)
==26176== by 0x7EBEEE: LAMMPS_NS::Input::execute_command() (run.h:16)
==26176== by 0x7EC799: LAMMPS_NS::Input::file() (input.cpp:202)
==26176== by 0x803744: main (main.cpp:30)
==26176==

- Mike

akohlmey · June 25, 2013, 8:25pm

The regression tests all pass with the current version of LAMMPS. With
the current version, both the CPU-only and GPU-accelerated runs lose
atoms around timestep 9000 for your water+graphene input.

I don't think that there are any issues with using GPU acceleration.

On a side note for Steve, I did hit a couple of memcheck errors in fix
move in the memory usage diagnostics unrelated to GPU (I don't think
that these will have any affect on the dynamics though):

looks like displaceflag and velocityflag are not initialized in the constructor.
should be harmless. i'll double check and adopt it into LAMMPS-ICMS for now.
steve will hate me for flooding him with patches next week.

axel.

Chen_Wei1 · June 26, 2013, 7:33am

Thanks,

I rebuild Feb 22th version. First "make clean-all", then "make no-user-cuda", finally "make openmpi".

Now there is no user-cuda package.

But it does not help.

The force from GPU is still much larger than that from only-CPU.

Best wishes,

Wei

_Brown_W_Michael · June 26, 2013, 12:32pm

I am sorry for your troubles Wei, but I don't think that the problem is with the GPU package.

I think that your system is unstable or possible, but less likely, that there is another problem in LAMMPS outside of the GPU package. The results of your simulation are very sensitive to the number of MPI tasks that are used and in the current version of LAMMPS, your script results in lost atoms, without using GPU acceleration.

I can try to help further, but since I cannot reproduce your issue, you will need to use the current version of LAMMPS to continue testing. Hopefully this will help to improve your script to produce more stable dynamics or if not, to sync versions so that we can further diagnose.

- Mike

sjplimp · June 26, 2013, 2:11pm

Fixed it - thanks Mike and Axel.

Steve

Chen_Wei1 · June 26, 2013, 4:57pm

Thank you very much for your help !

As you could not reproduce the issue by my input script, it seems that I did not set LAMMPS correctly. I used the command "lmp_openmpi -in in.file -suffix gpu", is that right ?

I will test the current version soon.

Thanks again,

Best wishes,

Wei

_Brown_W_Michael · June 26, 2013, 7:03pm

As you could not reproduce the issue by my input script, it seems

that I did not set LAMMPS correctly. I used the command "lmp_openmpi -in in.file -suffix gpu", is that right ?

The command is correct. I think the issue is that your results are sensitive to changes in order of operations. This will change with the number of MPI tasks, the type of accelerator (LAMMPS will choose the number of threads performing force-accumulation based on the type of GPU you have), the precision mode used for the GPU package, etc. In many cases it is possible for me to match your run exactly with "screen" output from your runs (GPU package doesn't write extra info to log file, but writes precision mode, threads, etc. to screen), but it is best to start with current LAMMPS version.

A way to verify this issue is to not use suffix gpu, and instead put the 'package gpu' option in the script directly. Do not use GPU acceleration until the final run at which point you add "/gpu" to the pair style before the run. For double precision, initial results should match exactly and for single/mixed possibly differ in the trailing digits.

- Mike

sjplimp · June 27, 2013, 2:19pm

For double precision, initial results should match exactly and for single/mixed possibly differ in the trailing digits.

I will just emphasize that this is for initial results. Over time a GPU vs CPU simulation will diverge
due to this effect plus others that Mike mentioned (order of operations, etc). Thus
at long times the 2 cases can be very different. If the system is stable/equilibrated/etc
then the 2 cases should be statistically identical, but if your system if borderline
unstable (it appears to have occasional problems in both CPU and GPU mode),
then either case could crash/blow-up at randomly different times.

Steve

Chen_Wei1 · June 28, 2013, 2:17pm

Hi, Steve and Mike.

Thank you very much for your help !

Best wishes,

Wei

Chen_Wei1 · June 29, 2013, 8:38am

Hi,

This is me again.

This time I tested a very simple system. I deleted all water molecules, so only two graphene sheets are left (names : top and bot). The graphene sheets are kept still during the whole simulation, and I calculate the total force on the top one by two methods

compute Force top group/group bot
variable Fx equal fcm(top,x)

variable Fy equal fcm(top,y)

variable Fz equal fcm(top,z)

The distance between the sheets is 40 Ångström on the Z direction, and the cut off radius is 10 Ångström. So the total force should be zero.

The graphene sheet is generated by VMD. Same in file and same workstation.

MPI indeed gives zero, and the results are independent of the number of the processors:

2 processors:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

100 0 6.32957e-10 0 2.19357e-10 0 0
200 0 6.32957e-10 0 2.19357e-10 0 0
…

4 processors:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

100 0 4.2921e-10 0 4.51444e-10 0 0
200 0 4.2921e-10 0 4.51444e-10 0 0
…

6 processors:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

100 0 1.87628e-09 0 -6.57828e-10 0 0
200 0 1.87628e-09 0 -6.57828e-10 0 0

But “variable fcm” gives wrong number by GPU package, which is far away from zero.

GPU:

Time-averaged data for fix AvgForce

TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz

100 0 -6.37399 0 -5.88719 0 0
200 0 -6.37399 0 -5.88719 0 0
…

Best wishes,

Wei

system.data (97 KB)

spce_graphene_fix_press.in (2.34 KB)

sjplimp · June 29, 2013, 1:39pm

This is for GPU or USER-CUDA? And
can you list your launch commands and
how many procs you ran on?

Steve

akohlmey · June 29, 2013, 1:43pm

This is for GPU or USER-CUDA? And
can you list your launch commands and
how many procs you ran on?

and also provide the output telling us what level of floating point
precision you compiled in.

axel.

akohlmey · June 29, 2013, 2:01pm

nevermind. after looking at the output, the explanation is obvious.
you must be compiling your GPU support with all single precision.

you have a *big* problem with your potential parameters for the
sheets, since they result in *huge* forces. even through they should
cancel, they are subject to floating point truncation errors. remember
that in double precision, you have only 7 valid digits of accuracy,
while in double precision there are about 15. this whole thing becomes
a problem if you are summing up large numbers that should result in a
small number. so here are results with different configurations:

[[email protected]... gpu-mpi]$ head -3 force.friction-*
==> force.friction-cpu <==
# Time-averaged data for fix AvgForce
# TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz
100 0 4.1927e-10 0 4.13735e-10 0 0

==> force.friction-double <==
# Time-averaged data for fix AvgForce
# TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz
100 0 -1.42867e-09 0 -3.71273e-09 0 0

==> force.friction-mixed <==
# Time-averaged data for fix AvgForce
# TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz
100 0 -5.85e-06 0 0.0230572 0 0

==> force.friction-single <==
# Time-averaged data for fix AvgForce
# TimeStep c_Force[1] v_Fx c_Force[2] v_Fy c_Force[3] v_Fz
100 0 -5.72074 0 -6.01798 0 0

as you can see, when i compile GPU support with full double precision,
i get comparable results to the CPU, while with mixed precision you
already loose a lot of precision and even more with doing all single
calculations.

in short. the GPU code works as advertised, you are getting what you
asked for. PEBCAC.

coming back to your original model, you don't really need the intra
sheet forces, do you. why not set for those pairs epsilon to zero? and
only have the graphene/water interaction computed.
you may also check, whether those parameters are proper...

axel.