An error was reported when I use the run_style split

Forestqinsheng · June 18, 2024, 9:09am

Dear Community.
Could you please help me look at the problems I am currently facing？
In order to make lammps run faster, I plan to use the run_style verlet/split command. When I use the verlet algorithm, it works well. But when I use the verlet/split command, something wrang as the picture shows.
The difference between the two runs is that 1. I added run_style verlet/split in the input file
2. I used the following command on the command line: mpiexec -np 4 lmp -partition 2 2 -in tip4p.txt or mpiexec -np 4 lmp -partition 2×2 -in tip4p.txt.

job aborted:
[ranks] message

[0] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 0

[1] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 1

[2] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 2

[3] application aborted
aborting MPI_COMM_WORLD (comm=0x44000000), error 1, comm rank 3

---- error analysis -----

[0-3] on DESKTOP-T7G531T
lmp aborted the job. abort code 1

---- error analysis -----

Could you give me some suggestions on how to revise it? Thank you very much for every suggestion.

akohlmey · June 18, 2024, 9:59am

You have installed a GUI version of LAMMPS and that does not support parallel running with MPI.
You need to uninstall this version and instead install a version of LAMMPS that has MSMPI in the name of then installer.

That means that your regular verlet run was running the same serial calculation 4 times when using mpirun -np 4. You can confirm that by looking at the log file where it should refer to 1 processor instead of 4.

Besides, with only 4 MPI ranks, it is not likely that there is any gain using verlet/split over verlet. At best with -p 3 1.

Forestqinsheng · June 18, 2024, 11:42am

Thanks for your reply.
Actually, the LAMMPS version I am using is LAMMPS 64 bit 2Aug2023-MSMPI. Because I did not change the folder after uninstalling the GUI version of llammps. I’m sorry!
I have changed the - partition 2 2 to - partition 3 1 in the cmd input However, it still reported an error, and the error is similar。
Do I need to make more changes if I want to use the verlet/split command？

akohlmey · June 18, 2024, 12:07pm

There is not enough information here to make any specific assessment.

You should provide the complete input deck to be able to reproduce your simulation. There must be an error in it that is not shown to the screen. Have you checked the screen.? files??

Also you should upload the “MPI task timing breakdown:” information for your regular verlet run.

As a point of reference, here is the timing output for the “rhodo” benchmark example. First for regular verlet with for MPI tasks:

Loop time of 6.64045 on 4 procs for 100 steps with 32000 atoms

Performance: 0.651 ns/day, 36.891 hours/ns, 15.059 timesteps/s, 481.895 katom-step/s
99.6% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 5.0777     | 5.1938     | 5.3921     |   5.2 | 78.22
Bond    | 0.19833    | 0.20452    | 0.21836    |   1.8 |  3.08
Kspace  | 0.60426    | 0.79964    | 0.92044    |  13.2 | 12.04
Neigh   | 0.16659    | 0.16662    | 0.16664    |   0.0 |  2.51
Comm    | 0.053485   | 0.053931   | 0.054229   |   0.1 |  0.81
Output  | 0.00019315 | 0.00020414 | 0.00023533 |   0.0 |  0.00
Modify  | 0.20754    | 0.20911    | 0.21193    |   0.4 |  3.15
Other   |            | 0.0126     |            |       |  0.19

and here for using verlet/split partition 0:

Loop time of 7.5521 on 3 procs for 100 steps with 32000 atoms

Performance: 0.572 ns/day, 41.956 hours/ns, 13.241 timesteps/s, 423.723 katom-step/s
99.7% CPU use with 3 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 6.5852     | 6.6661     | 6.7238     |   2.3 | 88.27
Bond    | 0.24862    | 0.26474    | 0.28059    |   2.5 |  3.51
Kspace  | 0          | 0          | 0          |   0.0 |  0.00
Neigh   | 0.21704    | 0.21707    | 0.21711    |   0.0 |  2.87
Comm    | 0.050077   | 0.10879    | 0.20697    |  21.2 |  1.44
Output  | 0.0002199  | 0.00022339 | 0.00023007 |   0.0 |  0.00
Modify  | 0.13303    | 0.13553    | 0.13763    |   0.5 |  1.79
Other   |            | 0.1597     |            |       |  2.11

and partition 1:

Loop time of 7.55064 on 1 procs for 100 steps with 32000 atoms

Performance: 0.572 ns/day, 41.948 hours/ns, 13.244 timesteps/s, 423.805 katom-step/s
95.9% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0          | 0          | 0          |   0.0 |  0.00
Bond    | 0          | 0          | 0          |   0.0 |  0.00
Kspace  | 2.1499     | 2.1499     | 2.1499     |   0.0 | 28.47
Neigh   | 0          | 0          | 0          |   0.0 |  0.00
Comm    | 0          | 0          | 0          |   0.0 |  0.00
Output  | 0          | 0          | 0          |   0.0 |  0.00
Modify  | 0          | 0          | 0          |   0.0 |  0.00
Other   |            | 5.401      |            |       | 71.53

There is only 12% of the total time spent on Kspace, so using a 3 1 partition split will allocate 25% of the resources, i.e. more than double to the Kspace partition and it will run slower since it is not in parallel across 4 processors plus there is overhead from verlet/split.

The result shows that regular MPI runs are over 10% faster (6.64 seconds versus 7.55 seconds).

Forestqinsheng · June 18, 2024, 12:25pm

THANK YOU FOR YOUR REPLY!
Here are files I am using. tip4p.txt is the in file, I run it on windows. tip4p is the in file, with mpiexec -np 4 lmp -in tip4p.txt. It works.

Rugular verlet:

Performance: 2.259 ns/day, 10.623 hours/ns, 26.149 timesteps/s, 1.046 Matom-step/s
185.4% CPU use with 4 MPI tasks x 8 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 182.83     | 17172      | 35640      |12989.2 | 22.45
Bond    | 170.1      | 190.74     | 215.62     | 118.7 |  0.25
Kspace  | 30151      | 48179      | 64624      |7427.5 | 62.99
Neigh   | 3373.6     | 3413.7     | 3461.2     |  60.4 |  4.46
Comm    | 1858.5     | 2666.9     | 3847.1     |1411.9 |  3.49
Output  | 42.683     | 42.931     | 43.565     |   5.6 |  0.06
Modify  | 4015.6     | 4404.5     | 4918       | 490.0 |  5.76
Other   |            | 416.1      |            |       |  0.54

But when change the runstyle and add -p it is something wrong. I add Echo both command. There are the file it output. I still confusing. cmd: mpiexec -np 4 lmp -p 3 1 -in tip4p.txt

akohlmey · June 18, 2024, 12:50pm

Run style verlet_split is not compatible with TIP4P. There is a test for it in the source code, but no mention in the manual. This will be corrected soon.

You should see the error message when turning off output buffering by adding the -nb flag to LAMMPS. But that flag has only been tested on Linux, so no guarantee it will work on Windows. It should only be used for debugging purposes to avoid truncated output.

Further down in your input you are using fix gcmc. That is most certainly not compatible with run style verlet/split.

Some more observations:

You are using MPI and OpenMP at 4x8 this would require a total of 32 CPU cores.
What kind of machine/CPU are you running on?

<200% CPU utilization with 8 threads is very bad. Could it be that you have a quad core CPU with hyperthreading enabled?

It would be far better to use only MPI or not more than 2 OpenMP threads for your setup.
Please note that the number of CPU cores required is the product of MPI tasks and OpenMP threads.

Forestqinsheng · June 18, 2024, 1:05pm

Thanks for your reply.
Acturally, the computer is 6 core and and12 threads.The cpu is amd ryzen 45 3600 6 core processor
I don’t know if overclocking has been set in the BIOS.

.
Can I Change the water model to spc/e?

akohlmey · June 18, 2024, 1:38pm

That choice is up to you. But for sure, there would be no performance gain on such a machine using the verlet/split run style.

A quick visualization shows that you have dense walls but a very dilute system in between, that will make using kspace more expensive since the cost of pppm scales primarily with the volume and not with the number of particles. You can try to reduce the kspace cost a little bit by using kspace_modify order 7 which reduces the number of grid points.

With four MPI tasks, I would add a processors 2 2 1 command at the top. I would also consider using 6 MPI tasks and then processors 2 3 1.
I would leave the Coulomb cutoff at 12 angstrom (making it shorter will only increase the cost of kspace, but not speed up the pair style much since you still have 12 angstrom for the lj/cut part).
Finally, you can set OMP_NUM_THREADS=1 but still use -sf omp since those styles have optimizations of the serial code path as well.
Also your neighbor list settings are a bit wasteful. I changed them to:

neighbor 2.0 bin
neigh_modify delay 6 every 2 check yes

With those changes, I can reduce the cost of your regular run by about 25%.
The largest contribution is from the processors keyword as this avoids the load imbalance imposed by the geometry of your system.

Forestqinsheng · June 18, 2024, 1:49pm

Wow! your suggestion helps me a lot! Acturally, I have a sever with 48 core 96 threads
. I will test the spc/e model.
Have a nice day! THANK YOU!!!

akohlmey · June 18, 2024, 2:01pm

You system is likely too small to have good strong scaling for up to 48 CPU cores. That is where using MPI+OpenMP can help (e.g. 24 MPI + 2 OpenMP). This is usually far more efficient than run style verlet/split which helps most when running on very large machines with many compute nodes and you are performance limited by communication bandwidth contention.

Forestqinsheng · June 18, 2024, 2:23pm

Unlucky, I change the water model to SPC/E. The problem is same.

Blockquote

akohlmey · June 18, 2024, 3:00pm

How many more times do I have to tell you that trying to use run style verlet/split for your system and your hardware is a fool’s errand and thus a waste of time?

You should get the best performance with all-MPI or at best MPI+OpenMP with 2 OpenMP threads per MPI task.

Forestqinsheng · June 18, 2024, 3:15pm

I will continue my simulation according to your suggestions in the future. Sorry to bother you with some foolish questions from me. I will remember what you said about using verlet/split not accelerating my simulation.
It was entirely driven by curiosity that I wanted to know what caused this problem. The problem is that I cannot use verlet/split instead of using it to speed up. Sorry about that.

akohlmey · June 18, 2024, 3:41pm

I’ve had a closer look at the error message and the corresponding code and it appears this is due to a bug in the MSMPI implementation. While the pointers are aliasing on local nodes, the communicator in use is an inter-communicator, i.e. sender and receiver are always on different processors.
The input deck runs without a complaint on my Linux desktop.

Forestqinsheng · June 18, 2024, 3:56pm

Thank you! Have a nice day!