GPU mixed precision

Would Lammps be able to run on the new Nvidia T4 GPU which can run in mixed precision mode but SINGLE_HALF, that is fp32 and fp16?

https://www.nvidia.com/en-us/data-center/tesla-t4/

If not, what is the preferred GPU to run Lammps in mixed precision mode DOUBLE_SINGLE?

Collecting feedback from the list before I go off testing our problem sets which involves mostly PMMA (methyl_methacrylate) aka acrylic glass or Plexiglas …

-Henk

Would Lammps be able to run on the new Nvidia T4 GPU which can run in mixed precision mode but SINGLE_HALF, that is fp32 and fp16?

no, and it would not make much sense. already computing forces in all single precision is a significant approximation and mostly works ok in homogeneous system, where there is a lot of error cancellation.
using half precision in any form for force computations is not advisable.

https://www.nvidia.com/en-us/data-center/tesla-t4/

for MD simulations you may be better off using volta generation hardware.

If not, what is the preferred GPU to run Lammps in mixed precision mode DOUBLE_SINGLE?

there is no real preference, since the choice that individual people make depends on many factors like budget, size of deployment, environment, other uses of the GPU nodes.

while you didn’t ask about it, i would also like to caution to use a mixed precision binary with constant pressure simulations, since the pressure has much larger errors with precision than forces. using all double precision is advisable in that case.

Collecting feedback from the list before I go off testing our problem sets which involves mostly PMMA (methyl_methacrylate) aka acrylic glass or Plexiglas …

that kind of information is not very helpful. the system under investigation has very little impact on how well you can use GPU acceleration. it is more a question of the number of GPUs per node, the architecture of each node, the number of nodes in total, the kind of interconnect and the system size. there are some benchmark results posted on the LAMMPS homepage at sandia to give you some general insight in what to expect.

axel.

Ok, agree on the Volta when Fp64 is required. Looking at my logs I’m now thoroughly confused.

I ran lmp_mpi_double_double using the in.coilloid example and saw this in the logs

[heme@…7208… lammps-5Jun19]$ egrep “^Device|500000” rtx-dd-1-1 t4-dd-1-1 | grep -v Overhead

rtx-dd-1-1:Device 0: Quadro RTX 6000, 72 CUs, 23/24 GB, 2.1 GHZ (Double Precision)​
rtx-dd-1-1: 500000 1.9935932 0.097293139 2.0905319 1.0497421 22963.374​
rtx-dd-1-1:Loop time of 361.205 on 1 procs for 500000 steps with 5625 atoms​

t4-dd-1-1:Device 0: Tesla T4, 40 CUs, 15/15 GB, 1.6 GHZ (Double Precision)​
t4-dd-1-1: 500000 1.9935932 0.097293139 2.0905319 1.0497421 22963.374​
t4-dd-1-1:Loop time of 416.856 on 1 procs for 500000 steps with 5625 atoms​

How/Why does the T4, having no Fp64 flops, come up with the same answer as the RTX gpu?

-Henk

Ok, agree on the Volta when Fp64 is required. Looking at my logs I’m now thoroughly confused.

I ran lmp_mpi_double_double using the in.coilloid example and saw this in the logs

[heme@…7208… lammps-5Jun19]$ egrep “^Device|500000” rtx-dd-1-1 t4-dd-1-1 | grep -v Overhead

rtx-dd-1-1:Device 0: Quadro RTX 6000, 72 CUs, 23/24 GB, 2.1 GHZ (Double Precision)
rtx-dd-1-1: 500000 1.9935932 0.097293139 2.0905319 1.0497421 22963.374
rtx-dd-1-1:Loop time of 361.205 on 1 procs for 500000 steps with 5625 atoms

t4-dd-1-1:Device 0: Tesla T4, 40 CUs, 15/15 GB, 1.6 GHZ (Double Precision)
t4-dd-1-1: 500000 1.9935932 0.097293139 2.0905319 1.0497421 22963.374
t4-dd-1-1:Loop time of 416.856 on 1 procs for 500000 steps with 5625 atoms

How/Why does the T4, having no Fp64 flops, come up with the same answer as the RTX gpu?

where does it say the T4 has no double precision floating point support?
to the best of my knowledge, it has (just like the consumer GPUs) much fewer FP64 units, so that nvidia doesn’t mention them as it might look bad in the spec sheets compared to the other features.
the same applies to the quadro rtx 6000, which - according to wikipedia - has 0.5 TFLOP/s peak double precision performance compared to 16.3 TFLOP/s peak single precision performance.

axel.