LAMMPS MPI runs for Mg mechanical properties

Good afternoon LAMMPS users,
Undergraduate student from UCF here trying to compare molecular dynamics results to a larger scale simulation for tension and compression test of Mg. I’m currently trying to run a 2 part box as shown in the code below however the simulation seems to be taking an excessive amount of time. I’m using the most recent windows version of LAMMPS and executing it using a lmp_mpi -in in.xxx format. I’ve attempted using the mpiexec -localonly #4 lmp_mpi -in in.xxxx format only to run into an error 109, unable to start the local smpd manager. Any insight on how to fix this error or run the code below more efficiently? Thank you.
Kind Regards
Jonathan Sosa

units metal
atom_style atomic
dimension 3
boundary p p p
variable latconst equal 3.20

lattice hcp {latconst} region cube block 0 20 0 20 0 20 create_box 1 cube lattice hcp {latconst} orient x 1 0 0 orient y 0 1 0 orient z 0 0 1
create_atoms 1 region cube

pair_style eam/fs
pair_coeff * * Mg.eam.fs Mg

variable L equal 128
variable dy equal 12.8
variable y1 equal -0.0
variable y2 equal {dy} region Lower block INF INF {y1} {y2} INF INF units box group Lower region Lower variable y1 equal {L}-{dy} variable y2 equal {L}
region Upper block INF INF {y1} {y2} INF INF units box
group Upper region Upper
group boundary union Lower Upper
group middle_atoms subtract all boundary

velocity all create 0.01 511124 rot yes mom yes

compute peratom all stress/atom pair virial
compute fy all reduce sum &
c_peratom[1] c_peratom[2] c_peratom[3] &
c_peratom[4] c_peratom[5] c_peratom[6]
compute p all reduce sum c_peratom[1] c_peratom[2] c_peratom[3]
variable sigmavolume equal c_fy[2]/vol
variable strain equal (ly-v_L)/v_L
variable press equal -(c_p[1]+c_p[2]+c_p[3])/(3*vol)
thermo 10000
thermo_style custom step ly vol v_strain v_sigmavolume temp etotal press v_press
fix relax all nvt temp 0.01 0.01 .01
timestep 0.005
run 500
unfix relax
variable upper_vel equal 0.01
fix zeroing_force_on_lower Lower setforce 0.0 0.0 0.0
fix zeroing_force_on_upper Upper setforce 0.0 0.0 0.0
velocity Lower set 0.0 0.0 0.0 units box
velocity Upper set 0.0 ${upper_vel} 0.0 units box
fix fix1 middle_atoms nvt temp 0.01 0.01 0.01
fix fix2 boundary nve
run 300000
velocity Upper set 0 0 0
run 1000
print “All done”

Good afternoon LAMMPS users,

hi jonathan,

Undergraduate student from UCF here trying to compare molecular dynamics
results to a larger scale simulation for tension and compression test of Mg.
I'm currently trying to run a 2 part box as shown in the code below however
the simulation seems to be taking an excessive amount of time. I'm using the
most recent windows version of LAMMPS and executing it using a lmp_mpi -in
in.xxx format. I've attempted using the mpiexec -localonly #4 lmp_mpi -in
in.xxxx format only to run into an error 109, unable to start the local smpd
manager. Any insight on how to fix this error or run the code below more
efficiently? Thank you.

thanks for writing a detailed, specific, and concise question with
sufficient information. this is rather uncommon and thus highly
welcome.

you are facing several problems:

first of all. the smpd.exe error results from not "installing" (or
registering) this service after installing the MPICH2 package.
i just updated the webpage with instructions for that. you won't be
able to use MPI without it.

second: the syntax for running across 4 processors would be:

mpiexec -localonly 4 lmp_mpi -in in.myfile

or

mpiexec -np 4 lmp_mpi -in in.myfile

third: as an alternative, you can try using multi-threading:

set OMP_NUM_THREADS=4
lmp_serial -in in.myfile -sf omp

fourth, i just posted to the mailing list, that i found inconsistent
timings as well. indeed in some cases MPI performed well or comparable
to multi-threading, in some cases it sucked, but in some other cases,
multi-threading performance was awful. this is very different from my
experience on linux. where the performance is more consistent. but
since i don't have a lot of experience running on windows (i use linux
to compile the windows packages), i don't know how common this is.

one final remark. parallel jobs are very sensitive to synchronization
issues. this is why the MPI version uses processor affinity to force
each parallel process to run on a fixed CPU core. if there is another
task (like a webbrowser) eating up some time, it may slow down overall
calculation, since all other processes have to wait. similarly for
multi-threaded calculations. in the latter case, i noticed that one
can get more consistent performance using one thread less than the
number of CPU cores available. seemingly, that migrates all
non-computing processes to run on just one of the cores and all others
can run the simulation exclusively and thus have less delays due to
synchronization.

please try it out and let us know.

thanks,
     axel.

ps: as a point of reference. on two CPU cores of an intel xeon E5-1603
at 2.8GHz running CentOS 6.4 the first 500 steps take 20 seconds when
running with MPI and 19 seconds when running with multi threading.