Running LAMMPS on AWS Cloud with HPC nodes

Hello LAMMPS Community:

I am using LAMMPS and running on AWS Cloud linux (Ubuntu 14.04) instances, and I am interested to know if other members of the LAMMPS community are also using Cloud computing for LAMMPS. Your experience need not be with AWS, but that is what I use, so it’s a plus if you are using AWS. I am hoping to learn more about how to better optimize performance, and my first thought was to check and see if anyone out there is even using LAMMPS with Cloud.

Thanks,

Matt Wessel

Hello LAMMPS Community:

I am using LAMMPS and running on AWS Cloud linux (Ubuntu 14.04) instances,
and I am interested to know if other members of the LAMMPS community are
also using Cloud computing for LAMMPS. Your experience need not be with AWS,
but that is what I use, so it’s a plus if you are using AWS. I am hoping to
learn more about how to better optimize performance, and my first thought
was to check and see if anyone out there is even using LAMMPS with Cloud.

matt,

before commenting on this, you should perhaps explain a bit more what
your expectations are.
specifically, how large systems you are trying to simulate, what type
of instances you are using, how many tasks (MPI, OpenMP, GPU, or a mix
of those) and how many concurrent simulations you expect to run.
and finally also, what kind of performance you currently get, and what
kind of improvements you hope to be getting. LAMMPS is a very large
and complex software and can be used in many different ways, so it is
not possible to make generic recommendations of all cases.

thanks,
    axel.

Axel,

That’s a great suggestion, but I’d prefer to wait until I know if anyone else is even doing this, which was my initial questions.

If you think it’s better for me to provide specifics, I can do that, and in great detail, but I thought it better to query for the existence of other cloud users first. Let me know if that’s an erroneous position.

Matt

Axel,

That's a great suggestion, but I'd prefer to wait until I know if anyone
else is even doing this, which was my initial questions.

If you think it's better for me to provide specifics, I can do that, and in
great detail, but I thought it better to query for the existence of other
cloud users first. Let me know if that's an erroneous position.

matt,

i don't know, since people usually provide very little information
about where they are running calculations, but it seems that the
majority of LAMMPS users run either on desktops or local or national
dedicated HPC resources. for researchers in academia, using cloud
computing resources is usually not financially very attractive unless
their computational needs are very irregular and more on the
high-throughput computing side. for most people doing MD simulations,
this is not the case and thus my expectations is, that there are
extremely few people that have tried or are using cloud resources.

the closest to cloud computing with reference to LAMMPS, that i have
seen being used with some efficiency are web frontends, sometimes also
referred to as "Science Gateways" (like nanohub.org ). those usually
feed into dedicated HPC clusters or supercomputers run at national HPC
centers, though.

also, my experience on this mailing list is that very generic
questions rarely yield many answers. just consider that other people
may be cautious and holding back for the same reasons that you did.
:wink:

axel.

Axel,

Appears as if my situation may be unique - as I’m private industry and not able to access HPC other than via cloud, as we don’t have our own internal resources. Suffice it to say, I think maybe what I can do is get some ideas if I provide what I’m actually doing, as an EC2 instance with 32 (actually, 36 vCPUs) coupled with 7 other 36 processor instance, all with ubuntu 14.04.4 LTS (kernel 3.130-36-generic). I am using apt-get lammps-daily for the lammps build, with includes OMP. With that information (and I can provide more), I believe extrapolations to other similar clusters running ubuntu 14.04.4 LTS can be made.

The systems I am simulating are drug/polymer systems, each with anywhere from 90K to 200K atoms. I am using the MMFF94s all atom force field. I was able to use class2 for bonds and angles, and OPLS for dihedrals, harmonic for impropers, which essentially exactly reproduces MMFF - but what LAMMPS doesn’t have is the same LJ potential and coulombic functions as are used in MMFF - but lj/cut/coul/long is doing well, and I’m not all that concerned with this (yet). I am also generating partial atomic charges for my systems with a third party package.

For my systems, an example drug molecule would be nifedipine. An example polymer would be poly-vinyl pyrrolidone, and I am using a truncated strand with 16 monomers (a point of discussion for another time and place).

Anyway, my setup is to use the 8 instances (each with 32 vCPUS) in a placement group on AWS (so they are theoretically in close proximity in the physical data center). The network speed is 10Gbps and I’m using the Intel 82559 VF interface for enhanced networking.

How I run LAMMPS:

In my input file, I am using package OMP 8

I run through NVE Langevin initialization, NVE randomization, then NPT cooling, equilibration and production. A production fix is as follows:

fix 1 all npt temp 300.0 300.0 100.0 aniso 1.0 1.0 1000.0 drag 0.2

My command line is: mpirun -np 32 --hostfile hostfile lammps-daily -in in.file &

the hostfile is what you would expect, it just lists the localhost and seven other instances where the tasks are to be run - example here:

localhost slots=4 max_slots=4
172.31.7.215 slots=4 max_slots=4
172.31.13.138 slots=4 max_slots=4
172.31.1.0 slots=4 max_slots=4
172.31.3.252 slots=4 max_slots=4
172.31.8.248 slots=4 max_slots=4
172.31.13.79 slots=4 max_slots=4
172.31.8.250 slots=4 max_slots=4

With this setup, I am getting about 3.9-4.1 ns/day/100k atoms

I’ve tried different OMP, -np and slots settings, and this seems to be the best I can do.

I also tried using the GPU package, but found it to be slower (which I honeslty attribute to me getting something wrong, either in how I compiled, or how I am using AWS GPU systems, and I have admittedly not done a lot of expermintation there for now).

Am I seeing reasonable times, or should a 256 processor cluster do better on 100K atoms with OMP?

Below is a sample of my output after a run of 100K timesteps:

Loop time of 3102.45 on 256 procs for 100000 steps with 145750 atoms
Performance: 2.785 ns/day, 8.618 hours/ns, 32.233 timesteps/s

609.8% CPU use with 32 MPI tasks x 8 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total

You seem to be getting several million timesteps a day on 256 cores which is alright (although some ways from terrific) in my experience using hpc. What exactly is the timescale you’re interested in for your drug simulation? Your potential might be complicated which ,as you noted, occupies the most time.

Hi Adrian,

Thanks for the information - I’n not sure what to expect, so if the consensus is that I’m actually getting pretty good times, then I think I’m OK with that. Spot instances on AWS are actually pretty cheap. What I did find is that when I went to 512 cpus, the timing did not improve linearly (I guess that’s to be expected) and Comm time went up. In fact, that started to become far less cost effective given the non-linear speedups.

Matt

256 may be fine, it all depends on how many timesteps you think you’ll need to simulate your phenomenon. I wouldn’t wait too long if you really needed those results and aren’t 100 percent sure about the setup.

matt,

please see some comments following the respective paragraphs and
statements below.

Axel,

Appears as if my situation may be unique - as I'm private industry and not
able to access HPC other than via cloud, as we don't have our own internal
resources. Suffice it to say, I think maybe what I can do is get some ideas
if I provide what I'm actually doing, as an EC2 instance with 32 (actually,
36 vCPUs) coupled with 7 other 36 processor instance, all with ubuntu
14.04.4 LTS (kernel 3.130-36-generic). I am using apt-get lammps-daily for
the lammps build, with includes OMP. With that information (and I can
provide more), I believe extrapolations to other similar clusters running
ubuntu 14.04.4 LTS can be made.

agreed. this is a fairly common setup. the major difference between a
dedicated HPC cluster is the network used by MPI.

The systems I am simulating are drug/polymer systems, each with anywhere
from 90K to 200K atoms. I am using the MMFF94s all atom force field. I was
able to use class2 for bonds and angles, and OPLS for dihedrals, harmonic
for impropers, which essentially exactly reproduces MMFF - but what LAMMPS
doesn't have is the same LJ potential and coulombic functions as are used in
MMFF - but lj/cut/coul/long is doing well, and I'm not all that concerned
with this (yet). I am also generating partial atomic charges for my systems
with a third party package.

For my systems, an example drug molecule would be nifedipine. An example
polymer would be poly-vinyl pyrrolidone, and I am using a truncated strand
with 16 monomers (a point of discussion for another time and place).

yeah. this is basically similar to any atomic scale molecular system
with long-range electrostatics.

Anyway, my setup is to use the 8 instances (each with 32 vCPUS) in a
placement group on AWS (so they are theoretically in close proximity in the
physical data center). The network speed is 10Gbps and I'm using the Intel
82559 VF interface for enhanced networking.

proximity matters more for bandwidth, where you are limited with
10GigE is latency.

How I run LAMMPS:

In my input file, I am using package OMP 8

I run through NVE Langevin initialization, NVE randomization, then NPT
cooling, equilibration and production. A production fix is as follows:

fix 1 all npt temp 300.0 300.0 100.0 aniso 1.0 1.0 1000.0 drag 0.2

My command line is: mpirun -np 32 --hostfile hostfile lammps-daily -in
in.file &

the hostfile is what you would expect, it just lists the localhost and seven
other instances where the tasks are to be run - example here:

localhost slots=4 max_slots=4
172.31.7.215 slots=4 max_slots=4
172.31.13.138 slots=4 max_slots=4
172.31.1.0 slots=4 max_slots=4
172.31.3.252 slots=4 max_slots=4
172.31.8.248 slots=4 max_slots=4
172.31.13.79 slots=4 max_slots=4
172.31.8.250 slots=4 max_slots=4

With this setup, I am getting about 3.9-4.1 ns/day/100k atoms

I've tried different OMP, -np and slots settings, and this seems to be the
best I can do.

I also tried using the GPU package, but found it to be slower (which I
honeslty attribute to me getting something wrong, either in how I compiled,
or how I am using AWS GPU systems, and I have admittedly not done a lot of
expermintation there for now).

Am I seeing reasonable times, or should a 256 processor cluster do better on
100K atoms with OMP?

i think that what you describe is quite decent for the environment you
are using. i would have expected worse.

Below is a sample of my output after a run of 100K timesteps:

Loop time of 3102.45 on 256 procs for 100000 steps with 145750 atoms
Performance: 2.785 ns/day, 8.618 hours/ns, 32.233 timesteps/s

609.8% CPU use with 32 MPI tasks x 8 OpenMP threads

MPI task timing breakdown:
Section | min time | avg time | max time |%varavg| %total
---------------------------------------------------------------
Pair | 1476.3 | 1525.5 | 1560.1 | 58.1 | 49.17
Bond | 129.1 | 132.55 | 136.89 | 16.2 | 4.27
Kspace | 887.21 | 927.4 | 971.73 | 76.0 | 29.89
Neigh | 77.959 | 78.574 | 78.87 | 3.4 | 2.53
Comm | 247 | 258.9 | 272.16 | 38.8 | 8.35
Output | 8.8807 | 8.8829 | 8.8897 | 0.1 | 0.29
Modify | 143.73 | 151.06 | 160.52 | 27.4 | 4.87
Other | | 19.59 | | | 0.63

Nlocal: 4554.69 ave 4657 max 4449 min
Histogram: 1 1 2 7 6 5 4 3 2 1
Nghost: 16507.5 ave 16640 max 16246 min
Histogram: 1 0 1 1 5 4 5 3 7 5
Neighs: 1.56464e+06 ave 1.61154e+06 max 1.5159e+06 min
Histogram: 2 5 3 2 1 5 3 4 5 2

Total # of neighbors = 50068548
Ave neighs/atom = 343.523
Ave special neighs/atom = 12
Neighbor list builds = 3582
Dangerous builds = 0

as you can see, the majority of the time is actually spend in Pair and
Kspace computations, but Comm is not insignificant.

yes. i think your assessment of the situation and your strategy to
obtain good performance is right.

as you can see from the "600%" CPU use with 8 OpenMP threads, the
OpenMP parallel efficiency is not great (it is best with 2-4 threads),
but using more MPI tasks and less OpenMP threads will overload your
network interfaces more and thus drive up "Comm" significantly. This
will also become a problem, if you try to use more instances at the
same time. this is where the infiniband (or better) network of
dedicated HPC clusters shines: you can push much further to using more
nodes and thus more MPI ranks and more CPU cores in total without
losing much parallel efficiency.

i think there are two major options that you can play with, in order
to optimize your performance:

- coulomb cutoff. increasing the coulomb cutoff will increase the time
in Pair, but reduce the time in Kspace. Pair has worse algorithmic
scaling, but multi-threads very well and needs very little
communication, but Kspace requires (much) more communication and
doesn't benefit from multi-threading as much as Pair. since you need
to use a larger number of threads to avoid network contention,
tweaking this might help.

- neighbor list skin distance and overall neighbor list settings.
neighbor list builds also require some communication, but the skin
distance also controls the ghost atom cutoff and the communication
cutoff (unless it is set explicitly). increasing the skin distance
reduces the number of neighbor list builds, but yours are already
fairly infrequent (every 28 steps on average). reducing the skin
parameter will trigger more neighborlist builds, but will also speed
up communication of ghosts and Pair (less pairs to check the distance
of). same as for the coulomb cutoff, there is an optimum. but due to
having a higher latency network, your needs may be different from the
default settings, and some small tweaks may cause some improvements.

there are some other minor tweaks, that might help as well, but since
you already did the MPI vs. OpenMP balancing, these are the remaining
obvious changes.

axel.

Hi Adrian,

Thanks for the information - I'n not sure what to expect, so if the
consensus is that I'm actually getting pretty good times, then I think I'm
OK with that. Spot instances on AWS are actually pretty cheap.

yeah, they become more expensive for academics when universities
collect overhead on them, as they usually do on "services".

What I did
find is that when I went to 512 cpus, the timing did not improve linearly (I
guess that's to be expected) and Comm time went up. In fact, that started to
become far less cost effective given the non-linear speedups.

yup. i think you are straddling the scaling limits of MPI with TCP/IP
networking. as mentioned before, i expected that to be worse, but it
seems that using OpenMP is helpful here.
if you want to push further, you may need to go to more complex
tweaks, like using verlet/split to single out Kspace to be run on a
separate partition with less processors (and thus less communication
overhead).

if your overall throughput is acceptable and your system sizes are
within the same range of your current system, then there is little
else you can do outside of starting a formal collaboration with a
research group in academia. but that will likely cost you more (e.g.
funding a grad student, or - more likely - a post doc or two) and thus
getting indirectly access to (dedicated) academic HPC resources.

mind you, there are some national supercomputing centers (e.g. NCSA)
that have a long tradition of industry collaborations, but - again - i
don't think that will be as competitive to what you get out of low
priority amazon instances. a quick google search also finds this:
https://www.xsede.org/industry-challenge-program

i've actually been part of a collaboration with P&G for a while and we
had a very large scale INCITE grant (side note: it was this
collaboration that motivated implementing the first version of the
USER-OMP package, as we needed to scale better to be competitive), but
- again - that also lead to P&G funding a postdoc position here at
Temple.

axel.

Axel (and Adrian),

Thanks for the very helpful insights. I will look at implementing these ideas.

One other thought that I just had - I believe that the apt-get lammps-daily build is compiled with double precision set for the fftw3 library. I may be wrong about that - in any event, would you guys expect a large difference in kspace time between single and double precision? I actually attempted to find this out on my own, but got bogged down in “compilation space”, if you will, and decided I needed to get results for my projects, so I put it on the back burner for now.

Matt

Axel (and Adrian),

Thanks for the very helpful insights. I will look at implementing these ideas.

One other thought that I just had - I believe that the apt-get lammps-daily build is compiled with double precision set for the fftw3 library. I may be wrong about that - in any event, would you guys expect a large difference in kspace time between single and double precision? I actually attempted to find this out on my own, but got bogged down in “compilation space”, if you will, and decided I needed to get results for my projects, so I put it on the back burner for now.

No, I don’t think there will be a big difference from switching to single precision FFTs.

Axel

Hi all,

In addition to all suggestions that Axel made to further increase the performance, one other important point in tunning a hybrid MPI+OpenMP simulation, as Axel once pointed out to me () and I benefited from it big time is how the MPI library that you’re using handles the processor affinities and also how you map your MPI ranks within each node can affect your performance. Basically, on the cluster that I’m running on I was getting the architecture detail of the cpu using ‘lstopo’ and then employ mapping and setting affinity to get best performance, for example I was running on nodes with 2 sockets of 10 cores cpus and after some experimentation I found out that using 4 MPI ranks each with 5 threads was giving the best performance. To set these you have to pass specific flags for the MPI library that you’re using, e.g for MVAPICH2-2.0 it was but for MVAPICH2-2.2b it is also different flags for openmpi, mpich and impi. I just wanted to mention this also as I had utilized it before and I saw positive performance by setting them. Axel can elaborate more on that and give better advice. Best, Kasra.

Hi all,

In addition to all suggestions that Axel made to further increase the
performance, one other important point in tunning a hybrid MPI+OpenMP
simulation, as Axel once pointed out to me
(http://lammps.sandia.gov/threads/msg56491.html) and I benefited from it big
time is how the MPI library that you're using handles the processor
affinities and also how you map your MPI ranks within each node can affect
your performance. Basically, on the cluster that I'm running on I was
getting the architecture detail of the cpu using 'lstopo' and then employ
mapping and setting affinity to get best performance, for example I was
running on nodes with 2 sockets of 10 cores cpus and after some
experimentation I found out that using 4 MPI ranks each with 5 threads was
giving the best performance. To set these you have to pass specific flags
for the MPI library that you're using, e.g for MVAPICH2-2.0 it was --map-by
ppr:2:socket:pe=5 -bind-to core but for MVAPICH2-2.2b it is -ppn 4 -bind-to
core:5 also different flags for openmpi, mpich and impi.

I just wanted to mention this also as I had utilized it before and I saw
positive performance by setting them. Axel can elaborate more on that and
give better advice.

i deliberately didn't mention this, since the situation when running
on virtual machines and when running on physical hardware is a bit
different and much more complicated. the main issue is: setting
processor/memory affinity wrong, will massively degrade performance,
getting it right will give some improvement. so this is a high-risk,
medium-to-small reward item. how much improvement is possible depends
on how the virtual machines themselves are configured and how much of
the hardware architecture is exposed. that adds a layer of complexity
to an already quite complex matter.

axel.

for the sake of completeness: ubuntu uses openmpi as default MPI
library and so does anton gladky's PPA, if i remember correctly.

Yes it does, at least as per the LAMMPS download page:

http://lammps.sandia.gov/download.html#ubuntu

Contact Dr. Renier Dreyer [email protected]…6273… . He has LAMMPS installed on CrunchYard and may be able to provide you with some information regarding your questions.

Jim

James Kress Ph.D., President

The KressWorks® Institute

An IRS Approved 501 ©(3) Charitable, Nonprofit Corporation

Engineering The Cure” ©

(248) 573-5499

Learn More and Donate At:

Website: http://www.kressworks.org

Facebook: https://www.facebook.com/KressWorks.Institute/

Twitter: @KressWorksFnd

Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.