Bad performance when re-using computes for multiple simple formula

GenieTim · May 30, 2023, 4:38pm

This is going to be a slightly over-the-top description of a performance regression I did not expect to encounter; maybe I misunderstood something, maybe there is a low hanging performance improvement possible, maybe just for my over-the-top example.

Anyway, coming to it: in a code such as the following:

# compute stresses
compute stressA all stress/atom NULL

compute pA1 all reduce ave c_stressA[1]
compute pA2 all reduce ave c_stressA[2]
compute pA3 all reduce ave c_stressA[3]
compute pA4 all reduce ave c_stressA[4]
compute pA5 all reduce ave c_stressA[5]
compute pA6 all reduce ave c_stressA[6]

variable pA12 equal c_pA1*c_pA1
variable pA22 equal c_pA2*c_pA2
variable pA32 equal c_pA3*c_pA3
variable pA42 equal c_pA4*c_pA4
variable pA52 equal c_pA5*c_pA5
variable pA62 equal c_pA6*c_pA6

variable pxx equal c_pA1*atoms/vol
variable pyy equal c_pA2*atoms/vol
variable pzz equal c_pA3*atoms/vol
variable pxy equal c_pA4*atoms/vol
variable pxz equal c_pA5*atoms/vol
variable pyz equal c_pA6*atoms/vol

variable nxya equal v_pxx-v_pyy
variable nxza equal v_pxx-v_pzz
variable nyza equal v_pyy-v_pzz

variable nxy equal c_pA1-c_pA2
variable nxz equal c_pA1-c_pA3
variable nyz equal c_pA2-c_pA3

fix 5 all ave/correlate/long 1 5000 v_pxx v_pyy v_pzz v_pxy v_pxz v_pyz v_nxya v_nxza v_nyza type auto nlen 16 ncount 2 ncorr 32 file &output-folder/&output-basename-vn.times-atoms-ovol.out.correlate.txt
# compute variance as = sumsq - ave^2
# putting another average on top is not ideal, 
# but let's take what we can get with reasonable effort and output, just to get an estimate for now
fix 7 all ave/time 1 2000 2000  c_pA1 c_pA2 c_pA3 c_pA4 c_pA5 c_pA6 v_pA12 v_pA22 v_pA32 v_pA42 v_pA52 v_pA62 c_allMsd0[4] mode scalar ave one file &output-folder/&output-basename-vn.out.avg.txt

if I am not mistaken, according to rough estimates from the measured performance, on each time-step, the averaging of the per-atom stresses happens ca. 64 times, and the per-atom stresses are computed 64*6 times, which is a lot more than necessary. Adding the values to the thermo output as well doubles the number on each thermo output step.

I guess this is one of those times where I was just assuming things, and one of those times, where the “not all ways are equal”, “just because it works does not mean its correct”, “bad usage does not mean bad code” (or similar) mantras are relevant.

Naïvely, I assumed the result will be memoized for each time-step. This does not seem to be the case, based on the observed performance regressions.

While a simple memoization on the step of the compute reduce seems straight forward, I nonetheless should have written the LAMMPS code differently; e.g., variable pA12 equal c_pA1^2 instead of variable pA12 equal c_pA1*c_pA1, and compute pA all reduce ave c_stressA[*] instead of the separate calls, to reduce the number of computes actually performed.

What other changes would I have to do to reduce the performance regression introduced by the computes?

akohlmey · May 30, 2023, 4:46pm

Global data can be cached with fix ave/time with Nrepeat set to 1.