Performance optimization for pair_style mliap via Kokkos on H100 GPU (625 atoms)

Hi everyone,

I was running mliap pair_style in an MD run for phase transition in perovskites, with the GPU-kokkos acceleration. But I am not sure whether I am running the code in the most efficient way.

I did the compilation following the instructions - https://mace-docs.readthedocs.io/en/latest/guide/lammps_mliap.html.

To study the phase transition, my script(attached below) initially equilibrates the system (625 atoms), slowly heats it, and then cools it back down.

in.bto (3.3 KB)

And then to run, I was using this command:

srun lmp -k on g 1 -sf kk -pk kokkos newton on neigh half -in in.bto

I was trying to run for 3 hrs on a H100 GPU to see how quickly the job gets done and it seems in the current rate of running, it would take a longer time to finish the simulation. So, I wanted to know whether I am making some mistake in the running process causing the slower run?

This was the output file :

BTO_5x5_40131332.out (12.1 KB)

Performance: 0.146 ns/day, 164.428 hours/ns, 1.689 timesteps/s, 1.056 katom-step/s
99.4% CPU use with 1 MPI tasks x 1 OpenMP threads

Thanks

Dominic

The mace pair style is developed and maintained by the MACE developers and it is not part of LAMMPS. You have to contact the MACE developers for questions about their code.

Thanks for your reply. I mainly wanted to ask whether this is the right method to use the LAMMPS-kokkos GPU acceleration.

srun lmp -k on g 1 -sf kk -pk kokkos newton on neigh half -in in.bto

Or is there some way to improve the parallelisation/efficiency of the script?

I will definitely post this in the MACE discussions as well.

I could only repeat to you what is written in the manual. Please have a look yourself 7.4.3. KOKKOS package — LAMMPS documentation

The main origin for efficiency when using a single GPU is in the implementation and for that you need to talk to the developers of that pair style. I know, for example, that the SNAP package in LAMMPS (also a machine learning pair style) has seen extensive optimizations resulting in a substantial speedup over the original straightforward implementation. I don’t know if the same has happened for MACE. Best is to look at some published benchmark numbers, run the same input and compare.

1 Like