acceleration of an inhomogeneous system

akohlmey · April 2, 2019, 9:14am

Hello everyone,
I have an inhomogeneous system that, there is a 30nm*30nm pva(Polyvinyl alcohol) flat floor at the bottom of the box and there are 8918 water molecules on the pva. What I am doing is to count the evaporated water molecules. And I will modify the pva floor to investigate the evaporation rate. The pva floor is fixed using fix setforce 0 0 0.
I did the test in my personal 4-core computer and the speed is 3.579 timestep/s. Then I submitted it to the HPC in my school where I can use 1 node of 24 cores. However, the speed is only 6.485 timestep/s. I know it is not possible to speed up about 6 times, but I was wondering what settings can I change to run it faster?
My guess is that, since I fix the pva floor, would it possible to put pva in, let’s say, 12 cores, and put water in other 12 cores? I look up processors and balance command, but not find a proper solution. Or is my guess wrong?

Could anyone give some instructions to speed it up?

Thanks a lot.

what you are looking for is what is generally referred to as “strong scaling”, i.e. getting speedup with the same size system.

there are two issues limiting this:

general strong scaling limit of the application. usually dictated by amdahl’s law, i.e. the fraction of the code that cannot be parallelized or through parallel overhead, i.e. extra work that needs to be done when running in parallel. with everything else being optimal (e.g. a homogeneous dense system), this depends on how much computing effort is needed, i.e. how “expensive” the computational model is (lj/cut is rather cheap, reaxff is rather expensive, see https://lammps.sandia.gov/bench.html#potentials) relative to the amount of overhead. thus there is a limit of a certain number of atoms per processor (core) beyond that there is no more speedup (or even a slowdown). for typical molecular systems, this point tends to be around 500-1000 atoms per CPU (and much higher when using GPUs).
load balance issues. that is how well the work is distributed across the processor (cores). LAMMPS gives information about that at the end of a log file (min/avg/max times). run a test with an increasing number of processors (e.g. 1, 2, 4, 6, 12, 24) and compare. you seem to have a quasi 2d system, so i would suggest to repeat this test with the “processors * * 1” command added to the input (assuming your PVA floor is in the xy-plane). if you still have significant load imbalance, you can try using the balance command to adjust the spatial distribution to follow more closely the number of atoms (with default weights). you also need to watch out in which part of the calculation the time is spend (Pair, Bond, KSpace, Comm, Modify, Other). if KSpace or Comm are dominating, then it typically means “game over”. if Modify or Other dominates, then you are spending too much time doing non-parallel operations with scripting or using computes or settings that require additional communication or serialization or processes.

you don’t provide enough details to give more specific advice. what you propose makes little sense. those kind of considerations rarely help. first you need to measure and thus know where your bottleneck(s) is/are. then you can try to address them. but amdahl’s law knows no mercy and once you reach the strong scaling limit of the code itself, there is little to do.

axel