hi here's my take on LAMMPS performance vs. network and CPUs.
first off, findings are very system specific and depend on
the typical problem set size and the expected total performance.
general statements about which cpu and/or network is better
are not easily transferable.
generally, (classical) md needs a lot of communication of
small(er) data sets, so latency of a network has much more
impact than bandwidth. also in zeroeth order of approximation
one can assume that the impact of latency scales about sqrt(n)
with the total number of nodes, i.e. when you quadruple your
nodes, the (average) "effective latency" doubles. so gigabit
ethernet with its high latency due to the TCP overhead will
scale out very fast. when using multiple cores per node, the
impact of latency will rise, due to the enforced serialization
at the network interface. this can be minimized by using latency
hiding and dynamical load balancing in a way like, e.g., NAMD
does it.
20% impact for two nodes is already a lot, if you want to scale
to tens of nodes. if not, you are fine. the impact of high latency
is mediated by the system size (as then bandwidth becomes more
important where gigabit is not as bad). but then again, for
larger systems you need longer trajectories, so you would actually
need to run even more nodes, to get the trajectory in a reasonable
time, wich would make blowing up the system size to get better
efficiency a bit useless.
finally, if you want to scale to thousands of nodes, you may
need something even better than infiniband (at that level, there
are differences between different models of infiniband cards),
like the torus networks of cray xt3/4 or BG/L. for instance, i
can run a 30k "atom" coarse-grained MD system on up to 2048 cpus
on a BG/L and still see a speedup when using more cpus.
that is about 15 atoms/cpu! (ok. no long-range electrostatics...).
coming to cpus, there one has to pay a lot of attention to
memory architecture and cache size. for few nodes usually
front side bus speed and memory speed matter most, for
more nodes, when the use of memory per node goes down, the
cache size becomes more important. on opteron machines one
has to make sure that jobs use only memory that is physically
attached to a given cpu to get the best performance. you usually
have to use a NUMA and/or processor affinity library to make
the best use of multi-core cpus. on intel quad core, it can
be faster to not use all cores, as that effectively doubles
the cache. i found this especially true when trying to scale
to a very large number of nodes, i.e. trying to push a job
very fast. to give a ballpark figure, i can run the 30k CG
system from above with up to 128 nodes of dual intel quad
code (using only half the cores) and get a throughput of
about 290ns/day at a 5fs time step (BG/L is at 160ns/day,
due to the much slower cpu). on the cray xt3 probably even
faster runs might be possible (haven't had a chance to try),
due to the superior network and fast cpus. but that is a
the very extreme limit of what can be done.
...and to add insult to injury, on recent tests i also
noticed, that there are significant differences in performance
and reliablity of the MPI libraries available for networks
like infiniband and alike.
so what is the take home message?
you have to know what you want to do and how fast, before
selecting a network and/or cpu and then do benchmarks
with typical setups to determine the "sweet spot".
hope this helps,
axel.