lammps/gpu on cilindrical Mac Pro, openCL vs. cuda

Thanks for pointing out that page Steve, very useful to me.

I'm making progress on getting gpu-lammps working on a Mac Pro, I just have a few precision-related questions left.

The manual says to test for the effect of single or mixed instead of double precision. How transferable are such tests? For example, if a calculation runs fine at equilibrium energies, would it still be ok in an ion bombardment simulation, where energy gradients and thus forces can be much, much greater? If potentials for one element witrhin eam/gpu work ok on single_double, would potentials for other elements in eam/gpu at single_double then likely be good too?

And what is the default precision in cpu-only lammps? Is it the double precision in which cpu results on the benchmarking page are reported?

And are single_single and single_double known to have lead to erroneous results, and if so, in what kind of calculations was that typically?

And finally, would I be correct in thinkin that the gpu section 5.3.2 of the manual contains a few bits that are out of date? It says things like setting precision through lines like
CUDA_PREC = -D_SINGLE_SINGLE # single precision for all calculations
and also about requiring an Nvidia card. From the Makefile.mac_opencl makefile in the lib/gpu directory I take it that precision should be set through something like OCL_PREC = -D_SINGLE_DOUBLE?

greets,
Peter

The manual says to test for the effect of single or mixed instead of double precision. How >transferable are such tests? For example, if a calculation runs fine at equilibrium energies, would it >still be ok in an ion bombardment simulation, where energy gradients and thus forces can be much, >much greater? If potentials for one element witrhin eam/gpu work ok on single_double, would >potentials for other elements in eam/gpu at single_double then likely be good too?

There is no simple answer to that Q.

People like Trung or Mike may have comments, but
you’ll have to just try it with your model.

Steve

Thanks for that info Steve.

One more small question. How can I set the openCL workgroup size? A paper by David A. Richie says that it is a runtime parameter, but I couldn't find how to set it in the manual or in the examples/accelerator example in-files. Are there ways to set it in the in-file?

greets,
Peter

Thanks for that info Steve.

One more small question. How can I set the openCL workgroup size? A paper by David A. Richie says that it is a runtime parameter, but I couldn't find how to set it in the manual or in the examples/accelerator example in-files. Are there ways to set it in the in-file?

please check out lib/gpu/lal_preprocessor.h and the function int
DeviceT::set_ocl_params in lib/gpu/lal_device.cpp

those define several presets for different OpenCL devices/vendors.
those can be selected via the package gpu command.

i still would rather go for a linux box with an nvidia GPU, especially
for the visualization.

axel.

To specify the workgroup size for pair force calculations (i.e. _block_pair in DeviceT) you can use the parameter blocksize in the package gpu command, e.g.

package gpu 1 blocksize 128

This parameter has not been documented yet, and will be in the next patch to the gpu package. Note that if the block size is too big for a small-sized system, there’ll be a runtime error at the moment.

-Trung

Thanks Trung.

I suspect I've come to the point in my testing where gpu memory is by far the most important bottleneck. Is there, in addition to blocksize, possibly another undocumented input parameter that can be used to control how many blocksized jobs are sent to the gpu at once? Or is the total workload always sent to the gpus simultaneously (in that case, what does the blocksize parameter do?)?

Axel, in an earlier email you mentioned oversubscribing gpus in order to squeeze out more performance. How is this done, and could the same method be used to somewhat 'undersubscribe' a gpu, thereby hopefully also taxing the gpu ramm less? The loss of performance would be a shame, but the condition of getting the stuff to run at all does need to be met first.

greets,
Peter

Thanks Trung.

I suspect I've come to the point in my testing where gpu memory is by far the most important bottleneck. Is there, in addition to blocksize, possibly another undocumented input parameter that can be used to control how many blocksized jobs are sent to the gpu at once? Or is the total workload always sent to the gpus simultaneously (in that case, what does the blocksize parameter do?)?

i don't think blocksize tuning will help there. as far as i know, the
largest memory eater is the neighbor list data (and particularly the
space needed to build the lists in parallel), which remains in the
GPU. so you would have to tweak settings that influence that. you can
easily make some tests on the CPU to see how neighbor list settings,
e.g. skin distance and cutoff impact the memory consumption.

Axel, in an earlier email you mentioned oversubscribing gpus in order to squeeze out more performance. How is this done, and could the same method be used to

you simply attached multiple MPI tasks to the same GPU. it should be
all documented.

somewhat 'undersubscribe' a gpu, thereby hopefully also taxing the gpu ramm less? The loss of performance would be a shame, but the condition of getting the stuff to run at all does need to be met first.

the only way to "undersubscribe" is to use more nodes and thus more
GPUs for the same job, as that would reduce the number of atoms a
single GPU "owns" and thus needs to compute and store neighbor lists
for.

axel.

FWIW, just saw this which confirms my observations from interacting
with OpenCL developers using MacOSX with GPUs, e.g. for VMD or OpenMM.

http://preta3d.com/os-x-users-unite/

axel.

Thanks for the follow-up info Axel.

Indeed, it doesn't sound very good. In double precision my eam/alloy/gpu calculations seem to work quite well on the Mac Pros, but in mixed precision there are crashes. Not sure if that is down to the issues you and the page you linked to mention.

greets,
Peter