How to reduce kspace timing%

Roger · October 25, 2023, 6:39am

Dear lammps users,

Currently, I’m using LAMMPS version 16 Mar 2018, to run water droplet-silica simulation.

I’m using 10 nodes with 40 cpus per node to run a 10nm x 10nm x 10nm system for 1 ns simulation. And the simulation can’t finish within 96 hours.

Probably because the kspace timing % is too large.

What should I do to reduce the timing %?

Log file:
lammps.o6384611 (139.9 KB)
Job script:
lmpsub.sh (614 Bytes)
Input scripts:
spce.mol (618 Bytes)
step7_minimize.data (4.2 MB)
step8_final.inp (709 Bytes)

Sincerely,
Roger

srtee · October 25, 2023, 7:26am

Hi Roger, thanks for uploading all your files. I happened to be in the mood to look through all of them, and there are many issues I could identify (and probably even more that I couldn’t in a hurry).

Please either ask your cluster administrators for help*, or find a local experienced HPC user to teach you about strong and weak scaling. In a few hours I will have time to write you one detailed post, but I cannot have give you (or anybody else here!) sustained engagement at a similar level.

*if they give you a severe warning about the jobs you have been running, please accept it and move on; your choices of settings this time mean that you have wasted lots of computing resources that a well-planned simulation would have used much, much more efficiently. It is perfectly natural for newcomers to make mistakes but you will have to accept responsibility and promise to do better in future.

srtee · October 25, 2023, 1:20pm

Here are some important points:

Scaling

Imagine that one chef can cook up a big menu in four hours. Now if two chefs work together, they may be able to cook the same food portions in two hours. But it might be difficult for four chefs to cook the same food in one hour, and you definitely cannot expect 240 chefs to cook the same food in one minute. At some point there are tasks that cannot be efficiently or safely split between different people.

It is the same with computational software. If a job runs on one core in 400 hours you should not expect it to run on 400 cores in one hour. You must test for yourself by running a short simulation (change number of steps in run) on 8 cores, 16 cores, 24 cores …

In fact you might further fine tune the scaling efficiency by strategically asking for a number of cores that is easily factorised. You have a box that is twice as high in z as it is wide in x and y. Now if you have 40 cores that divides the box into 2 by 4 by 5, and that may not be too bad for your system, but if you ask for 48 cores (24 per node) you may get that divided into 3 by 2 by 8 and those boxes may be a more even division of your simulation box.

MPI and OpenMP

You should always set MPI processes and OpenMP (or “OMP”) threads so that procs * threads = ncores. In your script instead you had 8 MPI procs per node and 8 OMP threads per proc, but assigned to 40 cores per node. This is really inefficient, because the result is

Per node 8 MPI “teams” of 5 cores each are formed
8 OpenMP threads are assigned to the 5 cores in each “teams”
5 cores will keep opening and closing the 8 threads as they try to do the work, which costs time and compute that you are wasting.

For LAMMPS you probably will not see good scaling on more than one node. So you don’t need to use OpenMP (plus you didn’t select the OMP accelerator in your LAMMPS script or command line, making it even less efficient). Just choose the number of cores and set the number of MPI procs equal to that.

Other miscellaneous issues

You have two fix npt on different parts of the system, both of which have poor settings (the pressure time constant should be much longer). Your results from whatever you did get are almost certainly wrong.

You are using mpiexec.hydra in your PBS script – maybe this is right for your cluster but I’ve never seen that anywhere else.

Why use newton off? That increases computation work especially when many particle-pairs are shared across multiple processors – which in turn was how your original simulation was set up.

akohlmey · October 25, 2023, 2:04pm

This is a very old version of LAMMPS. You should upgrade. If you use a more recent version, you have lots of bugs fixed, better error messages and better support for hybrid parallelization.

Using a recent version of LAMMPS would actually have refused to run your input because your are making invalid choices in your input file. So your LAMMPS version has the known bug that it should be refusing to run, but it still does (and thus will produce bogus results).

How large is “too large”? It is a very bad idea to guess in such matters. As a scientist you should instead rely on measurements and draw conclusions from those and - thanks to the timing summary output of LAMMPS - it is pretty straightforward to do so.

There also is a whole chapter in the LAMMPS manual discussing how to optimize performance.

There are a whole lot of issues here:

You say you are trying to use 400 CPU cores, but your posted log file shows that you are using 16 MPI processes (2 * 2 * 4) and 1 OpenMP thread per MPI process. So that is a huge discrepancy.
Your log file is inconsistent with your submit script. So you are providing us bad information. It is near impossible to provide good advice with bad data.
Your system has only 13584 atoms (according to the log file, the data file says 13515 atoms!!). That is a very small number. You cannot parallelize indefinitely, for simple pairwise potentials the limit of scaling in LAMMPS is at around a couple hundred atoms, so you are clearly requesting far too many CPU resources (yet don’t use what you request).
Where the actual limit of scaling lies, can be determined by running a sequence of short calculations (a few 1000 steps) with a sequence of 1, 2, 4, 8, 16, 32, … processors where you can then determine the parallel efficiency. With each doubling of processors, you would ideally need half the time, so if you multiply the loop time with the number of MPI processes it should remain a constant. The percentage of this value from multi-MPI runs relative to the 1 MPI time is your parallel efficiency. If it drops below 50%, there is no value in using more MPI processes
You have a sparse system, yet LAMMPS by default assumes a dense system, so more likely than losing time through Kspace, you will lose time due to load imbalance. You have lots of empty volume in z-direction, so a better domain decomposition (and thus better performance) would already be achieved by using the command processors * * 1 or processors * * 2. The latter only makes sense, if you also use the balance command to optimally shift the subdomain boundaries so they split the system so that the number of atoms per subdomain is close to equal in all subdomains. With 2 subdomains in z-direction, this is not given unless you use the balance command, so without “balance” (recommended for a first parallel scaling test) you should only use 1 subdomain division in z-direction.
To take advantage of OpenMP thread parallelization, you have to enable it, e.g. by appending -sf omp to the LAMMPS command line. Neither your submit script nor your output file give any indication that you are doing that.
Same as with MPI, there is an overhead associated with OpenMP, so proper benchmarking to determine the optimal number of threads is also needed. Since the OpenMP parallelization is orthogonal to MPI but also less efficient in LAMMPS than the MPI parallelization (due to the nature of how LAMMPS divides the work which favors MPI), you should add OpenMP threads only after you have reached the optimal performance with MPI. Typically, that will translate to a small number of threads (2, 4, 8 sometimes) being the optimal choice.
In your case of a sparse system, however, OpenMP has some additional potential, since it is not subdomain based but particle based, so it would have load imbalances across threads and it is easier to do a balanced domain decomposition with fewer MPI ranks.
Generally, Kspace timing is going to be significant, especially, with an 1.0e-6 convergence. Expect 30%-50% of the total time. This may be adjusted by growing or shrinking the coulomb cutoff (the total coulomb force will remain the same). Again, careful benchmarking is required.
Your input is invalid since you have two active fixes that change the same box dimension. This usually leads to bogus results, it only looks reasonable in your case, since both fixes will try to apply the exact same change.
Your fix npt commands make no sense, since they request isotropic box deformation. Yet you have a sparse system with empty volume in z-. You should have only one fix npt command and that should relax x-, and y- box dimension independently, but you can probably also just use fix nvt for this system for the production run. For fix nvt you don’t need precise pressure and thus can relax the PPPM convergence and gain some additional performance (forces converge faster than pressure with PPPM due to error cancellation).

Roger · October 26, 2023, 4:54am

Dear srtee and akohlmey,

Thanks for all those valuable suggestions and advice.

For the old version Lammps, I have to use version 16 Mar 2018 because it is the version installed on the HPC. And I don’t have the permission to update any software in it.

I’ll rewrite my scripts and follow your suggestions.

Thank you so much!

Best regards
Roger

akohlmey · October 26, 2023, 5:12am

If you can run simulations, you can also compile application software in your home folder. There is no need to install LAMMPS system wide and thus no elevated permissions are required. I for myself have always compiled all simulation software by myself on all HPC facilities I was using, to make certain that it was compiled in the most optimal way or perfectly configured for my specific needs.

That said, you can also tell your HPC admins that after 5 1/2 years it is high time to update application software and that the developers of that software have urged you to do so because of the large number of improvements to that software since.

srtee · October 26, 2023, 5:33am

It sounds like you may not have a supervisor or mentor at your university who can guide you in doing these simulations (if you did, they should advise you on compiling your own programs, for example).

I will say again that if you cannot find a local supervisor or mentor, there is a big risk that you will make small, easily-avoided mistakes and end up with lots of data that has relatively little use. (Actually both experimental and computational science is like this – but most people cannot access high-tech lab equipment without a local mentor’s help, whereas anyone can download a copy of LAMMPS from the Internet.) I myself recently showed some simulations to an older colleague, and he gave me advice that saved me from wasting weeks of computer time.

Ideally your supervisor should find a colleague that they can trust to collaborate with you on this. Recent papers have discussed how silica-water simulations that are too simple can omit basic effects here and here. Your local cluster administrators might also give you useful contacts.