How to make simulation faster?

Aleurite · January 15, 2023, 2:09pm

Hello Users,

I am currently working on MoS2 nano-ribbon (EMD). Used pair style is sw/mod. Because the correlation time is so long, the simulation takes 3-4 days to complete. So, I tried for parallel run with the command-
mpiexec -np 10 lmp -in input.lammps

However, after doing so, my simulation now takes twice as long as before. I’m not sure what’s going wrong.

To mention my laptop has 6 cores, 16 gb RAM ( GIGABYTE G5 GD Core i5 11th Gen RTX 3050) and I am using windows 11.

Could you provide me any more suggestions to speed up my simulation as it is taking so long?

akohlmey · January 15, 2023, 2:49pm

This is not unusual when you are requesting more parallel processes than you have physical CPUs. In general, when trying to run in parallel, it is a good idea to do some parallel scaling test, i.e. run with 1 process, 2 processes, 4 processes and then 6 processes. You may find that - depending on your system geometry and size of problem - your performance improvements from parallelization may stop before you reach the maximum.

But another item to look at is your overall efficiency. There is some discussion on that in this chapter of the LAMMPS manual: 7. Accelerate performance — LAMMPS documentation

akohlmey · January 15, 2023, 11:00pm

Since you say you are using a laptop, there is one more aspect to consider: with modern CPUs that clock rate can vary a lot depending on how many cores are used, whether AVX instructions are use and how well the CPU is cooled. Specifically with mobile CPUs, the difference between their nominal clock and their maximum clock as well as their typical clock can differ significantly. Also, the efficiency of cooling in laptops is not always as good as with desktops (due to space limitations). Thus with a 6 core CPU it is possible that the total number of instructions can be larger (due to a higher sustained clock rater) for using 4 CPU cores than with 6 CPU cores.
In addition, different operating systems have different preferences in how the schedule tasks of the operating system versus a user’s computations. Plus the laptop may be used for other tasks while the calculation is running. In those cases, it may be beneficial to leave 1 or 2 CPU cores idle when starting a LAMMPS simulation, so the OS can use them and does not inject OS tasks into the CPU cores used for the simulation (and thus slow it down).

Aleurite · January 16, 2023, 12:22pm

I tried this parallel scaling test. For 4 processes a small simulation took 1 hour 20 minutes(lowest time). But when I simply run my simulation without any parallel process, it takes 1 hour 9 minutes.

Later I tried this command using OPENMP package-
“env OMP_NUM_THREADS=10 lmp -sf omp -in input.lammps”

My simulation took only 24 minutes. However, the thermal conductivity has decreased to less than half of the previous value. (normal run).
In that case which one is correct? Or am I doing something wrong?

akohlmey · January 16, 2023, 12:49pm

That means that something is very wrong here. Can you post a log file from such a run? It doesn’t have to be using your own input, it could just be the “in.lj.lmp” input from the “bench” folder. Chances are, you have tried to run an executable in parallel that was not compiled to be run in parallel or that was compiled for a different MPI library than the one you are using to launch it.

As mentioned before, if you want the best performance, it is not a good idea to request more processes than you have physical CPU cores.

Some performance and utilization information is printed in the LAMMPS log output and discussed here: 4.3. Screen and logfile output — LAMMPS documentation

Probably neither run is correct. Thermal conductivity is a property that requires proper statistical convergence and that is very system dependent.

It is probably best, if you find somebody local that has experience with running parallel calculations and optimizing such calculations for performance. It does not have to be with LAMMPS. A lot of the do’s and don’ts and how you gauge performance and improve your parallel efficiency is independent of the software. It is much easier to learn this from talking to a local person compared to gathering this from a forum, specifically when you are still missing a lot of the fundamental knowledge.

Aleurite · January 16, 2023, 4:07pm

log_normal run.lammps (3.0 KB)
log_process 2.lammps (3.0 KB)
log_proces=4.lammps|attachment (3.0 KB)
log_10 threads.lammps (3.1 KB)

akohlmey · January 16, 2023, 6:44pm

Please note the following.

In your “normal” run you have:

Loop time of 2.57551 on 1 procs for 100 steps with 32000 atoms

Performance: 16773.409 tau/day, 38.827 timesteps/s
16.4% CPU use with 1 MPI tasks x 1 OpenMP threads

Which means, that there are other processes running on your computer consuming most of the CPU time and only about 1/6th is available to your LAMMPS process.

When I run on my (slow, old Windows 10) machine, I get instead (note the 100% CPU usage!):

Loop time of 3.67333 on 1 procs for 100 steps with 32000 atoms

Performance: 11760.434 tau/day, 27.223 timesteps/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads

Your MPI parallel logs support my suspicion, that they are not run in parallel because either the executable is not MPI enabled, or you are using the wrong MPI runtime installation. Despite requesting two MPI processes, there is no report of it, which means that you are the same calculations twice concurrently.

Loop time of 3.34212 on 1 procs for 100 steps with 32000 atoms

Performance: 12925.947 tau/day, 29.921 timesteps/s
19.6% CPU use with 1 MPI tasks x 1 OpenMP threads

The corresponding output on my machine is (note the 2 MPI tasks and the nearly 100% CPU):

Loop time of 1.92714 on 2 procs for 100 steps with 32000 atoms

Performance: 22416.626 tau/day, 51.890 timesteps/s
99.7% CPU use with 2 MPI tasks x 1 OpenMP threads

With OpenMP threading there should be significantly more than 100% CPU. Here is my output for 2 OpenMP threads:

Loop time of 1.84196 on 2 procs for 100 steps with 32000 atoms

Performance: 23453.258 tau/day, 54.290 timesteps/s
184.9% CPU use with 1 MPI tasks x 2 OpenMP threads

That you get less is another indication that your machine is very busy with other processes and for as long as that is the case, your calculations will always be slow.

Aleurite · January 16, 2023, 8:40pm

Yes, that’s what happened. Thank you. Now it’s working properly.