please see some comments following the respective paragraphs and
Appears as if my situation may be unique - as I'm private industry and not
able to access HPC other than via cloud, as we don't have our own internal
resources. Suffice it to say, I think maybe what I can do is get some ideas
if I provide what I'm actually doing, as an EC2 instance with 32 (actually,
36 vCPUs) coupled with 7 other 36 processor instance, all with ubuntu
14.04.4 LTS (kernel 3.130-36-generic). I am using apt-get lammps-daily for
the lammps build, with includes OMP. With that information (and I can
provide more), I believe extrapolations to other similar clusters running
ubuntu 14.04.4 LTS can be made.
agreed. this is a fairly common setup. the major difference between a
dedicated HPC cluster is the network used by MPI.
The systems I am simulating are drug/polymer systems, each with anywhere
from 90K to 200K atoms. I am using the MMFF94s all atom force field. I was
able to use class2 for bonds and angles, and OPLS for dihedrals, harmonic
for impropers, which essentially exactly reproduces MMFF - but what LAMMPS
doesn't have is the same LJ potential and coulombic functions as are used in
MMFF - but lj/cut/coul/long is doing well, and I'm not all that concerned
with this (yet). I am also generating partial atomic charges for my systems
with a third party package.
For my systems, an example drug molecule would be nifedipine. An example
polymer would be poly-vinyl pyrrolidone, and I am using a truncated strand
with 16 monomers (a point of discussion for another time and place).
yeah. this is basically similar to any atomic scale molecular system
with long-range electrostatics.
Anyway, my setup is to use the 8 instances (each with 32 vCPUS) in a
placement group on AWS (so they are theoretically in close proximity in the
physical data center). The network speed is 10Gbps and I'm using the Intel
82559 VF interface for enhanced networking.
proximity matters more for bandwidth, where you are limited with
10GigE is latency.
How I run LAMMPS:
In my input file, I am using package OMP 8
I run through NVE Langevin initialization, NVE randomization, then NPT
cooling, equilibration and production. A production fix is as follows:
fix 1 all npt temp 300.0 300.0 100.0 aniso 1.0 1.0 1000.0 drag 0.2
My command line is: mpirun -np 32 --hostfile hostfile lammps-daily -in
the hostfile is what you would expect, it just lists the localhost and seven
other instances where the tasks are to be run - example here:
localhost slots=4 max_slots=4
172.31.7.215 slots=4 max_slots=4
172.31.13.138 slots=4 max_slots=4
172.31.1.0 slots=4 max_slots=4
172.31.3.252 slots=4 max_slots=4
172.31.8.248 slots=4 max_slots=4
172.31.13.79 slots=4 max_slots=4
172.31.8.250 slots=4 max_slots=4
With this setup, I am getting about 3.9-4.1 ns/day/100k atoms
I've tried different OMP, -np and slots settings, and this seems to be the
best I can do.
I also tried using the GPU package, but found it to be slower (which I
honeslty attribute to me getting something wrong, either in how I compiled,
or how I am using AWS GPU systems, and I have admittedly not done a lot of
expermintation there for now).
Am I seeing reasonable times, or should a 256 processor cluster do better on
100K atoms with OMP?
i think that what you describe is quite decent for the environment you
are using. i would have expected worse.
Below is a sample of my output after a run of 100K timesteps:
Loop time of 3102.45 on 256 procs for 100000 steps with 145750 atoms
Performance: 2.785 ns/day, 8.618 hours/ns, 32.233 timesteps/s
609.8% CPU use with 32 MPI tasks x 8 OpenMP threads
MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
Pair | 1476.3 | 1525.5 | 1560.1 | 58.1 | 49.17
Bond | 129.1 | 132.55 | 136.89 | 16.2 | 4.27
Kspace | 887.21 | 927.4 | 971.73 | 76.0 | 29.89
Neigh | 77.959 | 78.574 | 78.87 | 3.4 | 2.53
Comm | 247 | 258.9 | 272.16 | 38.8 | 8.35
Output | 8.8807 | 8.8829 | 8.8897 | 0.1 | 0.29
Modify | 143.73 | 151.06 | 160.52 | 27.4 | 4.87
Other | | 19.59 | | | 0.63
Nlocal: 4554.69 ave 4657 max 4449 min
Histogram: 1 1 2 7 6 5 4 3 2 1
Nghost: 16507.5 ave 16640 max 16246 min
Histogram: 1 0 1 1 5 4 5 3 7 5
Neighs: 1.56464e+06 ave 1.61154e+06 max 1.5159e+06 min
Histogram: 2 5 3 2 1 5 3 4 5 2
Total # of neighbors = 50068548
Ave neighs/atom = 343.523
Ave special neighs/atom = 12
Neighbor list builds = 3582
Dangerous builds = 0
as you can see, the majority of the time is actually spend in Pair and
Kspace computations, but Comm is not insignificant.
yes. i think your assessment of the situation and your strategy to
obtain good performance is right.
as you can see from the "600%" CPU use with 8 OpenMP threads, the
OpenMP parallel efficiency is not great (it is best with 2-4 threads),
but using more MPI tasks and less OpenMP threads will overload your
network interfaces more and thus drive up "Comm" significantly. This
will also become a problem, if you try to use more instances at the
same time. this is where the infiniband (or better) network of
dedicated HPC clusters shines: you can push much further to using more
nodes and thus more MPI ranks and more CPU cores in total without
losing much parallel efficiency.
i think there are two major options that you can play with, in order
to optimize your performance:
- coulomb cutoff. increasing the coulomb cutoff will increase the time
in Pair, but reduce the time in Kspace. Pair has worse algorithmic
scaling, but multi-threads very well and needs very little
communication, but Kspace requires (much) more communication and
doesn't benefit from multi-threading as much as Pair. since you need
to use a larger number of threads to avoid network contention,
tweaking this might help.
- neighbor list skin distance and overall neighbor list settings.
neighbor list builds also require some communication, but the skin
distance also controls the ghost atom cutoff and the communication
cutoff (unless it is set explicitly). increasing the skin distance
reduces the number of neighbor list builds, but yours are already
fairly infrequent (every 28 steps on average). reducing the skin
parameter will trigger more neighborlist builds, but will also speed
up communication of ghosts and Pair (less pairs to check the distance
of). same as for the coulomb cutoff, there is an optimum. but due to
having a higher latency network, your needs may be different from the
default settings, and some small tweaks may cause some improvements.
there are some other minor tweaks, that might help as well, but since
you already did the MPI vs. OpenMP balancing, these are the remaining