Difficulties with multithreaded usage of LAMMPS on Windows as well as Ubuntu from Windows Subsystem for Linux

Antoine_Rincent · July 23, 2019, 7:36pm

Hi!

I really don’t understand how to use LAMMPS on more than one processor on my computer, and would be very grateful if you could point me towards the right direction or towards what I was doing wrong.

I’ve tried running it in Windows, on three separate computers, on all three computers the behavior was similar. I’ve tried using the mpiexec/mpirun executable from Intel’s Parallel Studio XE 2019, from Argonne Lab’s MPICH v.1.4.1p1 (as it’s the latest one I’ve found available with a windows installer) and from Microsoft MPI. My input scripts contain the line “suffix omp” and “package omp n”, where I tried different values of n, ranging from 1 to the maximum amount of processos available on the computer. In all cases this was done using LAMMPS’ prebuilt Windows executable lmp_mpi.exe.

When running with the MPICH’s mpiexec, using a command line option -n [number of processors] gave increasing performances, up to a cap of 4, anything above -n 4 (such as -n 6 or -n 16) would yield poorer or similar performances to using 4 processors. Whereas on the two others I’ve attempted using, using such an option no matter what combinations of options between -n (and synonyms) and -cpus-per-task (and synonyms) would execute n lammps instances on one processor each. For example, mpiexec -n 8 -cpus-per-task 1 lmp_mpi -in test.in and mpiexec -n 1 -cpus-per-task 8 lmp_mpi -in test.in would both, no matter what I’d write in my input file at Package omp ___, create 8 single threaded instances of LAMMPS that would print its results in the command window one after the other.

I then accepted that pure Windows wasn’t the way to go, and installed the latest version of Ubuntu in WSL (Windows Subsystem for Linux), installed LAMMPS and all other required “applications” in it, both by creating LAMMPS myself, using both cmake and make and simply a combination of makes, following the available LAMMPS documentation, and also by using the available Ubuntu precompiled lmp_daily version detailed in the documentation.

When running the lmp_daily, I need to use this function to allow the usage of more than one processor :

echo 0|sudo tee /proc/sys/kernel/yama/ptrace_scope

This way, multithreaded workloads do gain performances, but again, at higher than 4 processors, issues arise. During a minimisation, using 1, 2 or 3 processors stops at a given iteration, having reached the stopping criterion energy tolerence. However, using 4 or more processors makes the minimisation keep going and prints the exact same results over and over until the maximum number of allowed iterations is reached, as if there were no tests that the minimisation was complete.

Using the versions I’ve created myself, the same problem as on Windows comes forward, which is that multiple single-threaded instances are ran instead of a single multithreaded one.

I’m at a complete loss as to what I’ve done wrong and in need of your guidance. Sorry for the long email, I wanted to make sure I was thorough in what I’ve attempted and why it hasn’t worked, to help you help me!

What setup do you use to run LAMMPS in a multithreaded fashion on Windows on a high core count (16 to 32 cores)? Also, according to all my testing, both on Windows and on Linux based systems such as SLURM based supercomputers, the value of the environment variable OMP_NUM_THREADS has no real impact on performance, why would that be?

Thank you so much and have a great day, sorry again for the long read!
Antoine.

akohlmey · July 23, 2019, 8:36pm

Hi!

I really don’t understand how to use LAMMPS on more than one processor on my computer, and would be very grateful if you could point me towards the right direction or towards what I was doing wrong.

I’ve tried running it in Windows, on three separate computers, on all three computers the behavior was similar. I’ve tried using the mpiexec/mpirun executable from Intel’s Parallel Studio XE 2019, from Argonne Lab’s MPICH v.1.4.1p1 (as it’s the latest one I’ve found available with a windows installer) and from Microsoft MPI. My input scripts contain the line “suffix omp” and “package omp n”, where I tried different values of n, ranging from 1 to the maximum amount of processos available on the computer. In all cases this was done using LAMMPS’ prebuilt Windows executable lmp_mpi.exe.

a) you are mixing up MPI and OpenMP here. those are very different entities. i suggest you step back a little and get some basic education on what those are and how those work and perhaps do a little tutorial in how to program those. even if you never plan to write a parallel program after that, having that experience makes you understand how to use either parallelization method much better than by any other method.

b) you may only use MPICH v1.4.1p1 when using precompiled windows binaries, as the MPI library has to match what is used in the executable. otherwise it will launch multiple copies of the same calculation, because they don’t know that they are connected, as only the mpiexec tool matching the MPI library version knows how to set up the inter process communication properly.

the recommended way to use parallelism in LAMMPS is to start with MPI only (no suffix, no package command). the way how to do this is explain at length in the LAMMPS manual. once you are familiar with that, and under certain circumstances, where MPI gives poor performance (LAMMPS is designed in such a way, that MPI parallelism usually is more efficient), you should try using OpenMP or a combination of MPI and OpenMP.

only if you are having trouble running MPI correctly, or have security concerns or firewall problems, you may consider OpenMP as that has less requirements at launching correctly.

the specifics of how to run a precompiled windows binary in MPI mode or OpenMP mode or both are outlined on http://packages.lammps.org/windows.html

When running with the MPICH’s mpiexec, using a command line option -n [number of processors] gave increasing performances, up to a cap of 4, anything above -n 4 (such as -n 6 or -n 16) would yield poorer or similar performances to using 4 processors. Whereas on the two others I’ve attempted using, using such an option no matter what combinations of options between -n (and synonyms) and -cpus-per-task (and synonyms) would execute n lammps instances on one processor each. For example, mpiexec -n 8 -cpus-per-task 1 lmp_mpi -in test.in and mpiexec -n 1 -cpus-per-task 8 lmp_mpi -in test.in would both, no matter what I’d write in my input file at Package omp ___, create 8 single threaded instances of LAMMPS that would print its results in the command window one after the other.

I then accepted that pure Windows wasn’t the way to go, and installed the latest version of Ubuntu in WSL (Windows

the plain windows version works fine for either MPI and OpenMP and has been tested and used by quite a few people in parallel mode.

Subsystem for Linux), installed LAMMPS and all other required “applications” in it, both by creating LAMMPS myself, using both cmake and make and simply a combination of makes, following the available LAMMPS documentation, and also by using the available Ubuntu precompiled lmp_daily version detailed in the documentation.

When running the lmp_daily, I need to use this function to allow the usage of more than one processor :
echo 0|sudo tee /proc/sys/kernel/yama/ptrace_scope
This way, multithreaded workloads do gain performances, but again, at higher than 4 processors, issues arise. During a minimisation, using 1, 2 or 3 processors stops at a given iteration, having reached the stopping criterion energy tolerence. However, using 4 or more processors makes the minimisation keep going and prints the exact same results over and over until the maximum number of allowed iterations is reached, as if there were no tests that the minimisation was complete.

Using the versions I’ve created myself, the same problem as on Windows comes forward, which is that multiple single-threaded instances are ran instead of a single multithreaded one.

again, you seem to be confusing MPI and OpenMP here. MPI is not using threads, but independent processes that communicate via messages abstrated in a library (hence the acronym MPI for “Message Passing Interface”), while OpenMP requires shared memory and threads and is (mostly) directive based.

I’m at a complete loss as to what I’ve done wrong and in need of your guidance. Sorry for the long email, I wanted to make sure I was thorough in what I’ve attempted and why it hasn’t worked, to help you help me!

you haven’t read the documentation properly and are not following what was explained on the webpage for the precompiled installers. now why your calculation does not get increased performance with more CPUs may have lots of reasons, and it is impossible to tell, which is which from remote without knowing exactly what you are doing and how. please keep in mind, that all functionality in LAMMPS has to be compatible with MPI parallelism, but parallel efficiency depends on how effective domain decomposition can be applied to your system. OpenMP parallelised over particles, so that limitation does not apply there, but not all functionality is has OpenMP support, and amdahl’s law applies, that the amount of non-parallel code determines, how much speedup you can maximally get. depending on the size of your system, you may either reach a scaling limit, or you may run into issues with hyper-threading not giving much speedup (at best 20% in my experience), or you may have load balancing issues.

What setup do you use to run LAMMPS in a multithreaded fashion on Windows on a high core count (16 to 32 cores)?

Also, according to all my testing, both on Windows and on Linux based systems such as SLURM based supercomputers, the value of the environment variable OMP_NUM_THREADS has no real impact on performance, why would that be?

that must be because you are making mistakes or are trying to apply OpenMP multi-threading to an executable that has not been compiled accordingly or are using features for which no OpenMP variant exists.

axel.

Antoine_Rincent · July 23, 2019, 9:59pm

Alright thanks! I’ll do my homework and try my best at learning the subleties differenciating both of them.

Using the description in the website you linked for OpenMP execution, I was able to cut down on time and increase performance, thanks a bunch!

However, for MPI parallelism, following the information you’ve told me, how come this didn’t work?

Here’s a pastebin of my input script : https://pastebin.com/82tKeUy3
Here’s a pastebin of the output in the windows console : https://pastebin.com/MV9xNpXX
As you can see, I called for it to run on 2 processors, but it instead ran it twice on one processor each, since lmp_mpi is block buffered, you can see how each instance of lmp_mpi printed over one another.

Thanks again for your help!
Antoine.

akohlmey · July 23, 2019, 11:02pm

Alright thanks! I’ll do my homework and try my best at learning the subleties differenciating both of them.

Using the description in the website you linked for OpenMP execution, I was able to cut down on time and increase performance, thanks a bunch!

However, for MPI parallelism, following the information you’ve told me, how come this didn’t work?

i already told you! twice!!
i hate it when i have to explain the same thing multiple times. the next time i will not reply but add your e-mail to my “straight-to-trash” filter.
you must use the MPI launcher program that is bundled with the matching MPI library that was used to compile your executable.

axel.