[lammps-users] gigabit ethernet and multi-core performance

Hi lammps users,

I'm building a cheap linux cluster that will be primarily for running MD
simulations with lammps. The system I'm looking at is a series of dual
quad-core processors connected with gigabit ethernet. Does anyone have any
experience with this kind of system and running lammps?

My impression is that the multi-core processors are great until you start to
choke the shared memory bus, but it doesn't seem like MD is really memory
intensive (as opposed to say, calculating vibrational frequencies ab initio).
So even with the multiple-cores you'll still be limited by the number of
interconnects, is that right? Following that, does anybody know how many
gigabit interconnects you can go up to and still get good scaling?

Thanks a lot,
Andrew Stack

Andrew,

I have built clusters in the past, but they were dual processor single and
dual core. The main problem is that most of your processes will want to
talk to their neighbors. This means that the more processes run on one
machine, the more data has to go over your ethernet connection. I think you
still will be OK. It might not scale as nice as a single processor single
core machine when external communication is concerned, but the flip side is
that you will have more processes communicating internally, which typically
is faster.

Pieter

I've benchmarked LAMMPS on a dual quad-core desktop (8 procs total),
a Dell 690, and it does fine. About a 6x speed-up on 8 procs for
the standard LAMMPS benchmarks. You take a bit of a hit on
all 8 procs due to competing with the OS (Red Hat). Don't know
how it would perform on multiple such boxes connected via Ethernet
as that is very fast procs with a slow link.

Steve

Hi all.

FYI

Regarding multicore machines: I get much better results going from 1 core to 8 cores on a E53xx 2x4 Xeon system (only about 10% performance hit going to all 8 cores) as opposed to some 2x2 Opteron systems where I get almost a 50% performance hit going form 1 to all 4 cores.

Regarding network: I've benchmarked in.flow.couette (with various system sizes) on a 2x2 GigE and infiniband grid. Although the infiniband performance hit is tiny, the GigE performance hit is only in the ballpark of 20% which didn't justify the cost of infiniband for us.

--Craig

hi here's my take on LAMMPS performance vs. network and CPUs.

first off, findings are very system specific and depend on
the typical problem set size and the expected total performance.
general statements about which cpu and/or network is better
are not easily transferable.

generally, (classical) md needs a lot of communication of
small(er) data sets, so latency of a network has much more
impact than bandwidth. also in zeroeth order of approximation
one can assume that the impact of latency scales about sqrt(n)
with the total number of nodes, i.e. when you quadruple your
nodes, the (average) "effective latency" doubles. so gigabit
ethernet with its high latency due to the TCP overhead will
scale out very fast. when using multiple cores per node, the
impact of latency will rise, due to the enforced serialization
at the network interface. this can be minimized by using latency
hiding and dynamical load balancing in a way like, e.g., NAMD
does it.

20% impact for two nodes is already a lot, if you want to scale
to tens of nodes. if not, you are fine. the impact of high latency
is mediated by the system size (as then bandwidth becomes more
important where gigabit is not as bad). but then again, for
larger systems you need longer trajectories, so you would actually
need to run even more nodes, to get the trajectory in a reasonable
time, wich would make blowing up the system size to get better
efficiency a bit useless.

finally, if you want to scale to thousands of nodes, you may
need something even better than infiniband (at that level, there
are differences between different models of infiniband cards),
like the torus networks of cray xt3/4 or BG/L. for instance, i
can run a 30k "atom" coarse-grained MD system on up to 2048 cpus
on a BG/L and still see a speedup when using more cpus.
that is about 15 atoms/cpu! (ok. no long-range electrostatics...).

coming to cpus, there one has to pay a lot of attention to
memory architecture and cache size. for few nodes usually
front side bus speed and memory speed matter most, for
more nodes, when the use of memory per node goes down, the
cache size becomes more important. on opteron machines one
has to make sure that jobs use only memory that is physically
attached to a given cpu to get the best performance. you usually
have to use a NUMA and/or processor affinity library to make
the best use of multi-core cpus. on intel quad core, it can
be faster to not use all cores, as that effectively doubles
the cache. i found this especially true when trying to scale
to a very large number of nodes, i.e. trying to push a job
very fast. to give a ballpark figure, i can run the 30k CG
system from above with up to 128 nodes of dual intel quad
code (using only half the cores) and get a throughput of
about 290ns/day at a 5fs time step (BG/L is at 160ns/day,
due to the much slower cpu). on the cray xt3 probably even
faster runs might be possible (haven't had a chance to try),
due to the superior network and fast cpus. but that is a
the very extreme limit of what can be done.

...and to add insult to injury, on recent tests i also
noticed, that there are significant differences in performance
and reliablity of the MPI libraries available for networks
like infiniband and alike.

so what is the take home message?
you have to know what you want to do and how fast, before
selecting a network and/or cpu and then do benchmarks
with typical setups to determine the "sweet spot".

hope this helps,
   axel.