Fair comparison of cpu vs KOKKOS gpu?

alphataubio · September 6, 2024, 7:53pm

How would I make a fair chart comparing base performance cpu vs KOKKOS gpu ?

Examples of two clusters:

cpu node: 40 Intel Skylake cores 2.4 GHz
gpu node: 32 IBM Power9 cores, 4 NVIDIA V100-SMX2-32GB (2.6X)

cpu node: 64 AMD Rome 2.4 GHz cores
gpu node: 48 AMD Milan cores, 4 NVidia A100SXM4-40GB (4.0X)

On my local clusters, v100 are billed at 2.6x cpu hours and a100 at 4.8x. That doesn’t make sense to me because v100 has 5120 cuda cores, and A100 6912 cuda cores.

What metric(s) and how to normalize cpu vs gpu performance in a way that doesn’t cheat either?

akohlmey · September 6, 2024, 8:24pm

This is really a topic for Science Talk, since evaluating performance of hardware goes beyond LAMMPS.

The short answer is: there is no fair comparison. Too many factors play a role:

how well is the GPU code or tjhe CPU code optimized? how well does either vectorize? On a CPU you can save power budget by not using AVX at all and running with higher clock boost when AVX support is limited.
what are the characteristics of the algorithms used and how well do they map to GPU and CPU architectures (GPUs need many concurrent work units, CPUs are better at complex code)?
what level of accuracy are you after? what is required? If mixed precision is sufficient for your purposes, it would be unfair to benchmark in double precision because the architecture or code your compare to only supports double precision.
how well are the machines cooled? What is the corresponding BIOS/Kernel configuration?
can you run inside a node or do you need multiple nodes? what is the interconnect?
Neither the number of cores by itself, not the clock is a sole measure of performance, also the CPU/GPU architecture and memory bandwidth can have a significant impact on performance.
Do you want maximum throughput (process or produce as much data as possible, even if you have to wait for it) or maximum capability (get a result as quickly as possible)? This point in particular can result in counterintuitive performance characteristics. If money isn’t an issue and you want the absolute fastest result, you are often much better off with CPUs since they can give you strong scaling to a much smaller number of atoms per CPU core. GPUs scale out faster, but when they have plenty of data and work units to process they have massive advantages in throughput (that is why they do quite well on the Top500 list, the HPLinpack benchmark is effectively a weak scaling benchmark: if you have enough RAM you can increase the matrix size and get better parallel efficiency).
Some data centers normalize their hardware charges by the cost of the equipment. In a way that can also be a fair comparison (most bang for the buck).
If you are running with a batch system, performance also depends on how well you can match your jobs with the scheduling parameters and limitations.
How many jobs will fail. This was a big thing back when IBM BlueGene was the big hype hardware. Some of those machines went down more often than an affordable sex worker, and not always could you convince the allocations people to refund you. So you had to factor in how much of your allocation you would lose due to crashes and data corruption due bugs in OS, file system, CPUs, network etc.
I probably forgot some items

Most of these issues are open ended, i.e. there is no clear “this or that” answer.

stamoor · September 6, 2024, 10:59pm

I agree with Axel, no fair comparison, so you need to at least be very transparent so others can renormalize the data if they want. I typically either use 1 GPU vs 1 CPU node (e.g. all 40 cores) or node vs node, especially if I’m projecting or running out to a large fraction of the machines. In theory you could normalize by FLOPs, power, cost, etc. but normally I just do the above.

srtee · September 7, 2024, 2:38am

The end-user has the most knowledge about budgeting CPU-vs-GPU usage, which varies tremendously between and even within institutions. (For example, if I were sharing a cluster with GenAI researchers, I might never get a job onto the GPU queue.)

Thus, I don’t think there’s much point trying to be too exact about the speedup.

Having said that, you could use something like the performance per MPI process as an additional metric. Handwavingly, if a GPU can run four MPI procs without breaking a sweat then it’s “worth” four CPUs. The computer scientists can immediately tell me how wrong I am, but the end-user will be thankful for an additional piece of data, namely the number of MPI procs they can efficiently put on one GPU.

You can also measure strong scaling and its inverse:

strong scaling asks how my runtime decreases for a fixed-size problem on increasing numbers of CPUs / GPUs
its inverse asks how my runtime increases for a larger and larger problem on the same number of CPUs / GPUs

Better scaling makes a strong case for your code’s excellence.