I really don’t understand how to use LAMMPS on more than one processor on my computer, and would be very grateful if you could point me towards the right direction or towards what I was doing wrong.
I’ve tried running it in Windows, on three separate computers, on all three computers the behavior was similar. I’ve tried using the mpiexec/mpirun executable from Intel’s Parallel Studio XE 2019, from Argonne Lab’s MPICH v.1.4.1p1 (as it’s the latest one I’ve found available with a windows installer) and from Microsoft MPI. My input scripts contain the line “suffix omp” and “package omp n”, where I tried different values of n, ranging from 1 to the maximum amount of processos available on the computer. In all cases this was done using LAMMPS’ prebuilt Windows executable lmp_mpi.exe.
When running with the MPICH’s mpiexec, using a command line option -n [number of processors] gave increasing performances, up to a cap of 4, anything above -n 4 (such as -n 6 or -n 16) would yield poorer or similar performances to using 4 processors. Whereas on the two others I’ve attempted using, using such an option no matter what combinations of options between -n (and synonyms) and -cpus-per-task (and synonyms) would execute n lammps instances on one processor each. For example, mpiexec -n 8 -cpus-per-task 1 lmp_mpi -in test.in and mpiexec -n 1 -cpus-per-task 8 lmp_mpi -in test.in would both, no matter what I’d write in my input file at Package omp ___, create 8 single threaded instances of LAMMPS that would print its results in the command window one after the other.
I then accepted that pure Windows wasn’t the way to go, and installed the latest version of Ubuntu in WSL (Windows Subsystem for Linux), installed LAMMPS and all other required “applications” in it, both by creating LAMMPS myself, using both cmake and make and simply a combination of makes, following the available LAMMPS documentation, and also by using the available Ubuntu precompiled lmp_daily version detailed in the documentation.
When running the lmp_daily, I need to use this function to allow the usage of more than one processor :
echo 0|sudo tee /proc/sys/kernel/yama/ptrace_scope
This way, multithreaded workloads do gain performances, but again, at higher than 4 processors, issues arise. During a minimisation, using 1, 2 or 3 processors stops at a given iteration, having reached the stopping criterion energy tolerence. However, using 4 or more processors makes the minimisation keep going and prints the exact same results over and over until the maximum number of allowed iterations is reached, as if there were no tests that the minimisation was complete.
Using the versions I’ve created myself, the same problem as on Windows comes forward, which is that multiple single-threaded instances are ran instead of a single multithreaded one.
I’m at a complete loss as to what I’ve done wrong and in need of your guidance. Sorry for the long email, I wanted to make sure I was thorough in what I’ve attempted and why it hasn’t worked, to help you help me!
What setup do you use to run LAMMPS in a multithreaded fashion on Windows on a high core count (16 to 32 cores)? Also, according to all my testing, both on Windows and on Linux based systems such as SLURM based supercomputers, the value of the environment variable OMP_NUM_THREADS has no real impact on performance, why would that be?
Thank you so much and have a great day, sorry again for the long read!