Grace Hopper thermo output slow

Hi, I’m tracking down an issue where printing out the thermo info slows down LAMMPS considerably compared to the same run on an A100 Millan system. The case is based on the 0_nano benchmark from the OLCF-6 benchmark suite

https://code.ornl.gov/olcf-6_benchmarks/header/-/raw/release-1.0.0/OLCF-6_LAMMPS_benchmark.tar.gz

The essential difference is we are using the processors command, “processors * * * grid numa”.

At the end of the step loop in VerletKokkos::run() there is a call from only one of the hosts (out of four) to

atomKK->sync(Host,ALL_MASK);

That eventually goes through Kokkos::Impl::DeepCopyCuda(). It is doing a DtoH cudaMemcpy that takes a few seconds, which makes the MPI_Allreduce from other processes wait.

Is this a known issue? If so, are there any workarounds, besides changing the thermo frequency?

At any rate, we should get to the bottom of this.

Thanks,
Gary

Ultimately, this is a question for @stamoor
But in the meantime, please let us know which version of LAMMPS exactly you are using and how you compiled it.

Yes. I forgot to mention that. It is stable_29Aug2024_update1. The stable release October 1, 2024.

Can you please try the 4 Feb 2025 feature release and see if the issue persists?
Thanks.

@gkedziora for thermo output in LAMMPS, one could be dumping out atom coordinates, forces, etc. and the KOKKOS package doesn’t know what needs to be transferred from the GPU to the host CPU, so we currently have to transfer everything (all atom data). For a real use case, a user typically only needs thermo output every 1000 or 10000 timesteps, so the cost of the transfer is amortized over that many timesteps. There is currently no way around this since the dump files are not Kokkos aware. That said, I’m surprised GH is slower than A100, because as you saw it is just a DtoH cudaMemcpy.

There is an option to use unified memory on GH that could potentially help: Kokkos_ENABLE_IMPL_CUDA_UNIFIED_MEMORY. However, we likely need to tweak /src/KOKKOS/kokkos_type.h here: lammps/src/KOKKOS/kokkos_type.h at 52f068d1c57460d3218c0a0dabd312c15736dd58 · lammps/lammps · GitHub
similar to what we did for UVM. I haven’t tried this yet but would be very curious if it helps.

@gkedziora I talked to Evan Weinberg from NVIDIA on LAMMPS developer slack about this issue and this is what he said:

“The issue may be process/memory bindings. So they’ll need some type of numactl --cpunodebind=## --membind=## , where the --membind is more important, it makes sure all host allocations live in the right place. Generally on a GH node (be it one with a single GH or 4xGH) the membind and cpunodebind matches the GPU number, but that’s best confirmed by running nvidia-smi topo -m on the machine which shows such things including NUMA domains. The GPU itself is also a numa domain (i.e. you can malloc to it) but 99.99999% of the time you want things allocated to the host, like here. On 4xGH nodes I’ve seen copies from, say, GPU 1 going to CPU 0 being an absolute disaster due to the extra hop. In theory a well-configured SLURM setup will handle this type of thing auto-magically, but I’ve learned to trust SLURM zero percent of the time. If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box. And if something’s configured to work well in Kokkos on a non-GH system, you’re already set up for it to run well on GH.”

Note from myself, this is what we currently do in LAMMPS (opposite of what I suggested trying above):

“If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box.”

Yes! That did the trick! We have one GPU per node on our Grace Hopper system. I used ‘numactl --cpunodebind 0 --membind 0’. For 8 nodes, this table demonstrates the difference.

numactl cblk fom wall comp comm p2p send coll allredu
no no 43.7 218.1 103.4 114.5 61.0 51.8 45.6 32.6
no no 42.4 213.9 102.6 111.2 55.5 51.2 48.1 36.1
no no 42.0 215.2 103.7 111.4 52.9 52.3 50.7 35.7
no no 37.9 241.9 112.7 129.1 68.6 63.5 52.7 43.4
no no 30.5 260.7 123.6 136.9 55.4 54.6 73.9 62.1
no no 40.0 221.4 104.1 117.2 54.9 50.7 54.6 41.3
no no 42.7 213.8 102.0 111.6 57.4 54.8 46.4 34.8
yes no 75.7 158.2 88.0 70.1 47.0 46.8 16.0 6.3
yes no 75.7 156.4 87.0 69.4 47.1 47.0 15.4 5.7
yes no 76.9 158.1 90.0 68.0 46.6 46.5 14.7 4.7
yes yes 75.6 157.1 90.5 66.0 47.0 46.8 0.1 0.0
yes no 75.9 155.2 86.9 67.5 46.8 46.6 14.1 5.2

FOM is Matom steps/sec.
wall = wall time in seconds (all times in seconds)
comp = computation time
comm = communication time
p2p = point to point communication time
send = send time
coll = collective time
allredu = MPI_Allreduce time

The NAS profiler that generates this analysis, mpiprof, has the ability to put barriers around collective MPI calls. When that is invoked the MPI_Allreduce time goes to small milliseconds comparable to the Milan A100 system.

1 Like