Hi, I’m tracking down an issue where printing out the thermo info slows down LAMMPS considerably compared to the same run on an A100 Millan system. The case is based on the 0_nano benchmark from the OLCF-6 benchmark suite
The essential difference is we are using the processors command, “processors * * * grid numa”.
At the end of the step loop in VerletKokkos::run() there is a call from only one of the hosts (out of four) to
atomKK->sync(Host,ALL_MASK);
That eventually goes through Kokkos::Impl::DeepCopyCuda(). It is doing a DtoH cudaMemcpy that takes a few seconds, which makes the MPI_Allreduce from other processes wait.
Is this a known issue? If so, are there any workarounds, besides changing the thermo frequency?
Ultimately, this is a question for @stamoor
But in the meantime, please let us know which version of LAMMPS exactly you are using and how you compiled it.
@gkedziora for thermo output in LAMMPS, one could be dumping out atom coordinates, forces, etc. and the KOKKOS package doesn’t know what needs to be transferred from the GPU to the host CPU, so we currently have to transfer everything (all atom data). For a real use case, a user typically only needs thermo output every 1000 or 10000 timesteps, so the cost of the transfer is amortized over that many timesteps. There is currently no way around this since the dump files are not Kokkos aware. That said, I’m surprised GH is slower than A100, because as you saw it is just a DtoH cudaMemcpy.
@gkedziora I talked to Evan Weinberg from NVIDIA on LAMMPS developer slack about this issue and this is what he said:
“The issue may be process/memory bindings. So they’ll need some type of numactl --cpunodebind=## --membind=## , where the --membind is more important, it makes sure all host allocations live in the right place. Generally on a GH node (be it one with a single GH or 4xGH) the membind and cpunodebind matches the GPU number, but that’s best confirmed by running nvidia-smi topo -m on the machine which shows such things including NUMA domains. The GPU itself is also a numa domain (i.e. you can malloc to it) but 99.99999% of the time you want things allocated to the host, like here. On 4xGH nodes I’ve seen copies from, say, GPU 1 going to CPU 0 being an absolute disaster due to the extra hop. In theory a well-configured SLURM setup will handle this type of thing auto-magically, but I’ve learned to trust SLURM zero percent of the time. If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box. And if something’s configured to work well in Kokkos on a non-GH system, you’re already set up for it to run well on GH.”
Note from myself, this is what we currently do in LAMMPS (opposite of what I suggested trying above):
“If you can manually manage separate host and device allocations and copies between them and you’re fine with the theoretically doubled memory overheads, it’ll nearly always perform better out of the box.”
Yes! That did the trick! We have one GPU per node on our Grace Hopper system. I used ‘numactl --cpunodebind 0 --membind 0’. For 8 nodes, this table demonstrates the difference.
numactl
cblk
fom
wall
comp
comm
p2p
send
coll
allredu
no
no
43.7
218.1
103.4
114.5
61.0
51.8
45.6
32.6
no
no
42.4
213.9
102.6
111.2
55.5
51.2
48.1
36.1
no
no
42.0
215.2
103.7
111.4
52.9
52.3
50.7
35.7
no
no
37.9
241.9
112.7
129.1
68.6
63.5
52.7
43.4
no
no
30.5
260.7
123.6
136.9
55.4
54.6
73.9
62.1
no
no
40.0
221.4
104.1
117.2
54.9
50.7
54.6
41.3
no
no
42.7
213.8
102.0
111.6
57.4
54.8
46.4
34.8
yes
no
75.7
158.2
88.0
70.1
47.0
46.8
16.0
6.3
yes
no
75.7
156.4
87.0
69.4
47.1
47.0
15.4
5.7
yes
no
76.9
158.1
90.0
68.0
46.6
46.5
14.7
4.7
yes
yes
75.6
157.1
90.5
66.0
47.0
46.8
0.1
0.0
yes
no
75.9
155.2
86.9
67.5
46.8
46.6
14.1
5.2
FOM is Matom steps/sec.
wall = wall time in seconds (all times in seconds)
comp = computation time
comm = communication time
p2p = point to point communication time
send = send time
coll = collective time
allredu = MPI_Allreduce time
The NAS profiler that generates this analysis, mpiprof, has the ability to put barriers around collective MPI calls. When that is invoked the MPI_Allreduce time goes to small milliseconds comparable to the Milan A100 system.