Anyone for testing on a C4130 quad-K80 node?

Hello people,

Our group recently added a Dell quad-K80 C4130 node to its computing cluster. That is a very heavy gpu node with a total of 96 GB vram. So far, we've achieved up to 10 x speedup over using only the 32 Xeon cores in that system. So we're getting the computing power of over 300 cpu cores out of a 1U box. And there's several ways still for us to try to squeeze more performance out of it.

Prior to purchasing that machine I had received very useful help from various people, including one very informative long reply from Axel on the pros and cons of various configrations that that machine can be customized to.

Let me hereby see if I can return the favour and maybe be helpful to some here. If anyone is considering using the gpu package anf would like to know what sort of benefit their work would get from that, feel free to contact me and I can run a few tests.

greets,
Peter

Good to know Peter,

You are using the cuda, gpu or kokkos package?

Witch systems sizes you could scale with many gpus?

If you are using the gpu package, how many tasks per gpu are you using? I found that with more tasks the performance can be greatly enhanced.

Greets,

Hi James,

You are using the cuda, gpu or kokkos package?

gpu. I tried the cuda package first and it was faster. However, I ran into serious bugs. I read on the list here that the guy who wrote the package is unfortunately working on something else now.

Witch systems sizes you could scale with many gpus?

For a fully periodic system of eam metals, I could go up to just over 60 million atoms using the entire machine. For systems with open vacuum, the number decreases a bit.

If you are using the gpu package, how many tasks per gpu are you using? I found that with more tasks the performance can be greatly enhanced.

So far it's been one task per gpu with 2, 3 or 4 mpi processes per gpu. Most often, 3 mpi processes to one gpu is fastest.

If you say that oversubscribing can give better results, then I can certainly try that. Is it possible to just tweak the line in the in-file to oversubscrbe, like

package gpu 6 gpuID 0 0 0 1 1 1

If not, what is the syntax for using more than one task per gpu?

greets,
Peter

Hi Peter,

I currently have access to 20 intel core nodes with 3 K20.

I found out that using 6 mpi tasks per gpu was the best for my case (water with SPC/E Flexible and Charmm for hydrocarbons).

I run with the slurm job scheduler, without touching the input file, with this command:
srun -n 18 $LMP -sf gpu -pk gpu 3 < in.lammps > out.lammps

$LMP is the path of the lammps executable.

Running with mpi should be similar
mpirun -np 18 $LMP -sf gpu -pk gpu 3 < in.lammps > out.lammps

In my case, I was able to obtain an acceptable scaling for a 200k atoms system with up to 3 nodes, so, 9 gpus (about 80 steps per second, comparing to 30 with one node). I did not tried more because we only have 3 gpus nodes at the cluster.

I think you should try more mpi tasks, as you can combine cpus and gpus.

Hi James,

Ah, it seems I misread what you meant when you mentioned "tasks per gpu" in your earlier post. I thought you meant something like splitting the work for one gpu into two smaller gpu tasks, and then 'oversubscribing' one gpu with those two tasks. Whereas you meant multiple mpi processes for one gpu.

Yes, we have done the latter. For eam calculations the sweet spot is often 3:1, though we've also seen cases where it's 2:1 or 4:1. We never found a ratio as high as your 6:1 to be optimal.

greets,
Peter