This is going to be a slightly over-the-top description of a performance regression I did not expect to encounter; maybe I misunderstood something, maybe there is a low hanging performance improvement possible, maybe just for my over-the-top example.

Anyway, coming to it: in a code such as the following:

```
# compute stresses
compute stressA all stress/atom NULL
compute pA1 all reduce ave c_stressA[1]
compute pA2 all reduce ave c_stressA[2]
compute pA3 all reduce ave c_stressA[3]
compute pA4 all reduce ave c_stressA[4]
compute pA5 all reduce ave c_stressA[5]
compute pA6 all reduce ave c_stressA[6]
variable pA12 equal c_pA1*c_pA1
variable pA22 equal c_pA2*c_pA2
variable pA32 equal c_pA3*c_pA3
variable pA42 equal c_pA4*c_pA4
variable pA52 equal c_pA5*c_pA5
variable pA62 equal c_pA6*c_pA6
variable pxx equal c_pA1*atoms/vol
variable pyy equal c_pA2*atoms/vol
variable pzz equal c_pA3*atoms/vol
variable pxy equal c_pA4*atoms/vol
variable pxz equal c_pA5*atoms/vol
variable pyz equal c_pA6*atoms/vol
variable nxya equal v_pxx-v_pyy
variable nxza equal v_pxx-v_pzz
variable nyza equal v_pyy-v_pzz
variable nxy equal c_pA1-c_pA2
variable nxz equal c_pA1-c_pA3
variable nyz equal c_pA2-c_pA3
fix 5 all ave/correlate/long 1 5000 v_pxx v_pyy v_pzz v_pxy v_pxz v_pyz v_nxya v_nxza v_nyza type auto nlen 16 ncount 2 ncorr 32 file &output-folder/&output-basename-vn.times-atoms-ovol.out.correlate.txt
# compute variance as = sumsq - ave^2
# putting another average on top is not ideal,
# but let's take what we can get with reasonable effort and output, just to get an estimate for now
fix 7 all ave/time 1 2000 2000 c_pA1 c_pA2 c_pA3 c_pA4 c_pA5 c_pA6 v_pA12 v_pA22 v_pA32 v_pA42 v_pA52 v_pA62 c_allMsd0[4] mode scalar ave one file &output-folder/&output-basename-vn.out.avg.txt
```

if I am not mistaken, according to rough estimates from the measured performance, on each time-step, the averaging of the per-atom stresses happens ca. 6*4 times, and the per-atom stresses are computed 6*4*6 times, which is a lot more than necessary. Adding the values to the thermo output as well doubles the number on each thermo output step.

I guess this is one of those times where I was just assuming things, and one of those times, where the â€śnot all ways are equalâ€ť, â€śjust because it works does not mean its correctâ€ť, â€śbad usage does not mean bad codeâ€ť (or similar) mantras are relevant.

NaĂŻvely, I assumed the result will be memoized for each time-step. This does not seem to be the case, based on the observed performance regressions.

While a simple memoization on the step of the compute reduce seems straight forward, I nonetheless should have written the LAMMPS code differently; e.g., `variable pA12 equal c_pA1^2`

instead of `variable pA12 equal c_pA1*c_pA1`

, and `compute pA all reduce ave c_stressA[*]`

instead of the separate calls, to reduce the number of computes actually performed.

What other changes would I have to do to reduce the performance regression introduced by the computes?