Grace Hopper thermo output slow

gkedziora · February 13, 2025, 9:55pm

Hi, I’m tracking down an issue where printing out the thermo info slows down LAMMPS considerably compared to the same run on an A100 Millan system. The case is based on the 0_nano benchmark from the OLCF-6 benchmark suite

https://code.ornl.gov/olcf-6_benchmarks/header/-/raw/release-1.0.0/OLCF-6_LAMMPS_benchmark.tar.gz

The essential difference is we are using the processors command, “processors * * * grid numa”.

At the end of the step loop in VerletKokkos::run() there is a call from only one of the hosts (out of four) to

atomKK->sync(Host,ALL_MASK);

That eventually goes through Kokkos::Impl::DeepCopyCuda(). It is doing a DtoH cudaMemcpy that takes a few seconds, which makes the MPI_Allreduce from other processes wait.

Is this a known issue? If so, are there any workarounds, besides changing the thermo frequency?

At any rate, we should get to the bottom of this.

Thanks,
Gary

akohlmey · February 13, 2025, 9:57pm

Ultimately, this is a question for @stamoor
But in the meantime, please let us know which version of LAMMPS exactly you are using and how you compiled it.

gkedziora · February 13, 2025, 10:14pm

Yes. I forgot to mention that. It is stable_29Aug2024_update1. The stable release October 1, 2024.

akohlmey · February 13, 2025, 10:16pm

Can you please try the 4 Feb 2025 feature release and see if the issue persists?
Thanks.

stamoor · February 13, 2025, 10:43pm

@gkedziora for thermo output in LAMMPS, one could be dumping out atom coordinates, forces, etc. and the KOKKOS package doesn’t know what needs to be transferred from the GPU to the host CPU, so we currently have to transfer everything (all atom data). For a real use case, a user typically only needs thermo output every 1000 or 10000 timesteps, so the cost of the transfer is amortized over that many timesteps. There is currently no way around this since the dump files are not Kokkos aware. That said, I’m surprised GH is slower than A100, because as you saw it is just a DtoH cudaMemcpy.

There is an option to use unified memory on GH that could potentially help: Kokkos_ENABLE_IMPL_CUDA_UNIFIED_MEMORY. However, we likely need to tweak /src/KOKKOS/kokkos_type.h here: lammps/src/KOKKOS/kokkos_type.h at 52f068d1c57460d3218c0a0dabd312c15736dd58 · lammps/lammps · GitHub
similar to what we did for UVM. I haven’t tried this yet but would be very curious if it helps.

stamoor · February 14, 2025, 3:23pm

@gkedziora I talked to Evan Weinberg from NVIDIA on LAMMPS developer slack about this issue and this is what he said:

“The issue may be process/memory bindings. So they’ll need some type of numactl --cpunodebind=## --membind=## , where the --membind is more important, it makes sure all host allocations live in the right place. Generally on a GH node (be it one with a single GH or 4xGH) the membind and cpunodebind matches the GPU number, but that’s best confirmed by running nvidia-smi topo -m on the machine which shows such things including NUMA domains. The GPU itself is also a numa domain (i.e. you can malloc to it) but 99.99999% of the time you want things allocated to the host, like here. On 4xGH nodes I’ve seen copies from, say, GPU 1 going to CPU 0 being an absolute disaster due to the extra hop. In theory a well-configured SLURM setup will handle this type of thing auto-magically, but I’ve learned to trust SLURM zero percent of the time. If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box. And if something’s configured to work well in Kokkos on a non-GH system, you’re already set up for it to run well on GH.”

Note from myself, this is what we currently do in LAMMPS (opposite of what I suggested trying above):

“If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box.”

gkedziora · February 14, 2025, 5:29pm

Yes! That did the trick! We have one GPU per node on our Grace Hopper system. I used ‘numactl --cpunodebind 0 --membind 0’. For 8 nodes, this table demonstrates the difference.

numactl	cblk	fom	wall	comp	comm	p2p	send	coll	allredu
no	no	43.7	218.1	103.4	114.5	61.0	51.8	45.6	32.6
no	no	42.4	213.9	102.6	111.2	55.5	51.2	48.1	36.1
no	no	42.0	215.2	103.7	111.4	52.9	52.3	50.7	35.7
no	no	37.9	241.9	112.7	129.1	68.6	63.5	52.7	43.4
no	no	30.5	260.7	123.6	136.9	55.4	54.6	73.9	62.1
no	no	40.0	221.4	104.1	117.2	54.9	50.7	54.6	41.3
no	no	42.7	213.8	102.0	111.6	57.4	54.8	46.4	34.8
yes	no	75.7	158.2	88.0	70.1	47.0	46.8	16.0	6.3
yes	no	75.7	156.4	87.0	69.4	47.1	47.0	15.4	5.7
yes	no	76.9	158.1	90.0	68.0	46.6	46.5	14.7	4.7
yes	yes	75.6	157.1	90.5	66.0	47.0	46.8	0.1	0.0
yes	no	75.9	155.2	86.9	67.5	46.8	46.6	14.1	5.2

FOM is Matom steps/sec.
wall = wall time in seconds (all times in seconds)
comp = computation time
comm = communication time
p2p = point to point communication time
send = send time
coll = collective time
allredu = MPI_Allreduce time

The NAS profiler that generates this analysis, mpiprof, has the ability to put barriers around collective MPI calls. When that is invoked the MPI_Allreduce time goes to small milliseconds comparable to the Milan A100 system.