What hardware to get? Unexperienced questions!

Hello everyone,

If you needed patience before to answer my questions, now this is going to put you close to the edge of it.

We’ve just gotten ~ $10k to renew the computer equipment for our everyday work. Of course, I’m not asking you to make me a list of things, but basically any guidelines on what would be enough, what too much, etc will be welcome.

We work on molecular dynamics, but with no inner structure at all: just a bunch of points over there interacting with each other with rather complicated potentials (hence the use of tables is quite handy).

Lately our computations have been of ~20k atoms, for 4M verlet steps, although we’ve been thinking about studying larger systems, up to ~100k.

My basic question is: would you recommend for this to focus on GPU computation power or CPU? If it’s the first one, as I guess, how important is the CPU configuration to exploit the capabilities of the GPU?

Again, I’m not asking for anything too specific, I can take a look at benchmarks to decide between different CPU and GPU, but a general overview, or any literature you can recommend, will be welcome.

Thanks! This list has been really helpful :slight_smile:

Pablo

Pablo,

I’m sure Axel will dump a lot of tech specs on you. My only advice is on another front. Today you have 10K today. That’s good. Yet, the main Q at least from my point of view is what do you see you/your group doing in the future? If you are not the PI but just a student pass this Q to the corresponding person in charge. The best strategy to invest money in any field (not only in the computing one) is to gauge a bit your future/needs goals. That is the key to really advance your portfolio.

Carlos

Hello everyone,

If you needed patience before to answer my questions, now this is going to
put you close to the edge of it.

We've just gotten ~ $10k to renew the computer equipment for our everyday
work. Of course, I'm not asking you to make me a list of things, but
basically any guidelines on what would be enough, what too much, etc will be
welcome.

that is difficult to say. you also didn't say how the money is
supposed to be spent. to be blunt, 10k is not a lot, so it is really
not worth agonizing too much about it. you can get the equivalent of
much more money in compute power by applying to one of the national
supercomputing centers or supercomputing organizations. it takes some
paper pushing, but that is usually time well spent. *and* you get to
run on hardware, that you can't afford at all (-> run big jobs fast).

so there is advice #1

second, you have to decide, how much time and effort you want to spend
on maintaining this equipment? if not a lot, the best option is to get
a 4-way 16 or 12 core AMD opteron machine. those are quite affordable.
the equivalent Intel CPU based machines are unaffordable, and while a
two-way intel based machine are much better per CPU core, you need a
high-speed network to make good use of them for big calculations. if
you don't need a lot of RAM, you may even be able to squeeze two of
those 4-way monsters into your budget and each of them is like a
cluster in itself, but you can install/run it like a desktop. low
maintenance effort is a priority to a lot of people...

so there is advice #2

third, if you are willing to go crazy and want the most bang for the
buck, then likely GPUs are the way to go, but since reasonably
future-proof Tesla equipment is so incredibly expensive, you'd be best
off with some desktop gear, and you'd stick them into a single
processor desktop machine. from the specs, it looks like the fastest
desktop GPU for MD is the GeForce TITAN currently, it is rumored to
have good double precision floating point (you'd need to do a real
benchmark to know for certain), so that might be worth it, but you'd
need a recent intel CPU with PCIe-3.0 to take good advantage of it.
GPUs need to generate a lot of threads to be efficient, so you need a
lot of atoms (like 20,000 - 100,000) to take full advantage of that,
and your budget really doesn't allow for a good high-speed network. so
you are limited by what you can run on a single node.
in short, this is the best option, if you're after high throughput and
don't need a lot of flexibility.

so there is advice #3

finally, there are a bunch of other options, but it all boils down to
how much work are you willing to invest, how flexible do you need the
machine(s) to be and is there any option to do resource sharing (the
large the pile of money, the better the deal). several places i worked
at, even offered matching deal for that, i.e. you get to use hardware
worth twice the budget (or more if other shareholders don't use their
shares well), and have no bother with maintenance. you're giving up
your freedom, though, so that is not everybody's preferred choice.

lastly, advice #4: don't trust any vendor. most have no clue. they
only know how to sell stuff and mostly repeat what they get told by
their suppliers and at best how to install it, but know very little
about operating and using their own hardware.
they often try to sell too high-end hardware and stuff that isn't even
available, but supposed to be the greatest thing since sliced bread.
you quickly realize, it isn't. ...but that is something you don't have
to work out with the sales person, but the support staff, who usually
have no idea what impossible things the sales person promised.

We work on molecular dynamics, but with no inner structure at all: just a
bunch of points over there interacting with each other with rather
complicated potentials (hence the use of tables is quite handy).

Lately our computations have been of ~20k atoms, for 4M verlet steps,
although we've been thinking about studying larger systems, up to ~100k.

there is nothing that one can infer from such general terms.
performance characteristics can be very different for different
systems. it depends massively on the potentials and the particle
densities, for example.

My basic question is: would you recommend for this to focus on GPU
computation power or CPU? If it's the first one, as I guess, how important
is the CPU configuration to exploit the capabilities of the GPU?

if you get desktops, you get the GPU (almost) for free (just get a
beefier one, that will help with visualization, too),
but watch out for heat and power.

Again, I'm not asking for anything too specific, I can take a look at
benchmarks to decide between different CPU and GPU, but a general overview,
or any literature you can recommend, will be welcome.

never trust any benchmarks that you have not done yourself, don't
trust what other people publish. some you can trust, but it takes a
lot of effort to find out which one. especially around GPUs it has
become common practice to present a very skewed view, since there is
such much difference in the way how you can deploy them. often times
people simply compare apples with oranges; they almost feel compelled
to do it, since the initial numbers years ago on selected compute
kernels were promoting the expectation of having everything 100x
faster. 10-20x per GPU vs. 1 single CPU core is much more realistic,
which means there is not so much left nowadays, when the number of
cores per CPU is growing all the time.

axel.

Your $10K machine will deliver approximately 100K core hours of
computer time a year. You could spend about 10 hours writing a solid
proposal for a small supercomputing allocation and get yourself well
over 1M core hours a year for no cost whatsoever through various
programs that are open to researchers around the world (e.g. XSEDE,
INCITE, PRACE). The only reason you shouldn't do this is if your
science is not going to pass moderate peer review, in which case you
probably shouldn't do it anyways.

If you don't want to supercompute, you will probably get a better
value from a cloud service than your own machine because, unless
you're really good with scripting and don't take vacation, you'll not
get anywhere near 100% utilization. Additionally, you'll waste a week
installing everything. If you pay by the hour for cloud time, you'll
only pay for what you use and you won't have to deal with software
maintenance BS.

Jeff

IF I had 10K then what would serve more to me is a suite of
commercial compilers..
A Salute
Oscar G.

I have found this discussion very interesting since I am also in a position to make a decision on hardware.

I am actively exploring all options mentioned so far:

A. Build my own cluster (sooo much work for one lonely professor)

B. Buy into a cluster in a larger state university (less work but less flexibility)

C. Pay as you go with Amazon EC2

Does anyone have experience with cloud services like C? Direct experience with an alternative to Amazon? In choosing a cloud service I am looking at:

  1. Cost

  2. Capacity to expand and shrink the power of my cluster as my needs vary

  3. Ability to extract my data from the cloud (cost!)

  4. Ease of use

Can anyone else think of other considerations?

Oscar, I don’t know much about the advantages of commercial compilers vs. the open source once. Would one expect them to be much faster? Are they a worthwhile investment?

-s-

Your $10K machine will deliver approximately 100K core hours of
computer time a year. You could spend about 10 hours writing a solid
proposal for a small supercomputing allocation and get yourself well
over 1M core hours a year for no cost whatsoever through various
programs that are open to researchers around the world (e.g. XSEDE,
INCITE, PRACE). The only reason you shouldn't do this is if your
science is not going to pass moderate peer review, in which case you
probably shouldn't do it anyways.

IF I had 10K then what would serve more to me is a suite of
commercial compilers..

LLVM 3+ and GCC 4.8 both autovectorize for x86/SSE processors. I
seriously doubt Intel C++ beats them by much, if at all. People
should feel free to prove me wrong with data. I'll try it myself now.

If you want LAMMPS to run faster, take the force field kernel you use
the most, unroll the loops and insert some SSE/AVX/FMA intrinsics.
That should get you better performance than any optimizing compiler
and at the cost of only your time.

Jeff

[...]

LLVM 3+ and GCC 4.8 both autovectorize for x86/SSE processors. I
seriously doubt Intel C++ beats them by much, if at all. People
should feel free to prove me wrong with data. I'll try it myself now.

there is indeed not much of a difference between the gnu and intel C++
compilers. i haven't tried a recent clang/LLVM version. the older one
was a tiny bit slower. gnu enforces aliasing rules by default, intel
does not, so gnu has an advantage there unless you turn them on with
intel. the basic force loops in many of LAMMPS force kernels are
vectorization proof, due to the data layout (coordinates and forces
are stored as arrays of arrays. the /omp styles gain some extra
performance by casting that into a array of structures construct
(through some rather ugly type casting to a const reference)).

the part where you *do* see a noticeable difference between intel and
gnu are in potentials that make a lot of use of functions from libm.
this is because the intel compiler silently replaces the call to libm
with code from an internal math library. the functions in glibc are
tuned for accuracy and for several releases specifically the
exponential functions (including pow()) were extremely expensive due
to excessive calls to some internal calls in the library to enforce
rounding. in any case, work is underway to make this a non-issue:

https://svnweb.cern.ch/trac/vdt and (a bit more aggressive and in
plain C not C++) https://github.com/akohlmey/fastermath

If you want LAMMPS to run faster, take the force field kernel you use
the most, unroll the loops and insert some SSE/AVX/FMA intrinsics.
That should get you better performance than any optimizing compiler
and at the cost of only your time.

several people have tried this, there are even some publications on
those subjects. to make this work you need to change the data
structures and best also the neighbor list, so that you avoid the need
to chase pointers (which inhibits vectorization) and unaligned access
and gather operations.

ciao,
    axel.

I have found this discussion very interesting since I am also in a position
to make a decision on hardware.

I am actively exploring all options mentioned so far:

A. Build my own cluster (sooo much work for one lonely professor)

you need a sufficient supply of geeky grad students. that was a well
working method, when i was one. you could throw a stone at a group of
theory students and hit one. it is getting difficult these days, tho.

B. Buy into a cluster in a larger state university (less work but less
flexibility)
C. Pay as you go with Amazon EC2

Does anyone have experience with cloud services like C? Direct experience
with an alternative to Amazon? In choosing a cloud service I am looking at:

1. Cost

relative to running your own machine it will likely be more expensive.
at universities you usually don't have to pay (in full) for power and
cooling (after all you have overhead subtracted from your grant money,
right?). also, if you set up something with desktop hardware, you beat
prices of "server style" equipment significantly. but it only works,
if you also have cheap labor (i.e. a sufficient supply of geeky grad
students).

cloud computing companies are in it to make money, so your savings
would mostly come from the "no effort" and "pay as you go" parts. in
other words, you have to examine how well you will use your local
hardware. this is why i like those 48/64 core AMD machines so much,
they are just a cluster in disguise of a workstation for a fair price.
they are a major pain in a larger cluster, though, because it is much
harder to have them well utilized and extract good performance from
them.

2. Capacity to expand and shrink the power of my cluster as my needs vary

the question always is: how much. chances are, your need are small
compared to what is offered. question is, how specific a hardware do
your need. the majority of cloud computing provisioning is tailored
for embarrassingly parallel work, i.e. there is no high-speed
multi-node computing. the latter will come at a premium. this is what
makes buying into a local HPC resources attractive. those usually
specialize in hardware geared for HPC.

3. Ability to extract my data from the cloud (cost!)

well, that requires some monitoring and smart budgeting. remember your
make a contract with a commercial entity that is selling a service and
making more money, if they can find ways to charge you for more. the
pay-as-you-go concept is what makes it work. to make a comparison,
just look at the cell phone market. you can get a fixed contract or a
pay-as-you-go phone. you have to examine your usage patterns to find
out which is the better option.

4. Ease of use

everything has a learning curve. i'm currently in the process of
teaching myself some of that virtual machine stuff. bottom line, once
you get the hang of it, it is as easy or as difficult as using any
other remote resource.

Can anyone else think of other considerations?

Oscar, I don't know much about the advantages of commercial compilers vs.
the open source once. Would one expect them to be much faster? Are they a
worthwhile investment?

only for very specific applications. compilers are in many ways like
voodoo, if you believe in, it works for you. :wink:

axel.

LLVM 3+ and GCC 4.8 both autovectorize for x86/SSE processors. I
seriously doubt Intel C++ beats them by much, if at all. People
should feel free to prove me wrong with data. I'll try it myself now.

The results of my preliminary study are provided here:
https://wiki.alcf.anl.gov/parts/index.php/LAMMPS#Compiler_Comparison

I'm not claiming that I did a perfect job or that LJ is a useful model
to benchmark so please take the results with the appropriate number of
grains of salt.

there is indeed not much of a difference between the gnu and intel C++
compilers. i haven't tried a recent clang/LLVM version. the older one
was a tiny bit slower. gnu enforces aliasing rules by default, intel
does not, so gnu has an advantage there unless you turn them on with
intel. the basic force loops in many of LAMMPS force kernels are
vectorization proof, due to the data layout (coordinates and forces
are stored as arrays of arrays. the /omp styles gain some extra
performance by casting that into a array of structures construct
(through some rather ugly type casting to a const reference)).

Yeah, I saw a talk by someone on the Intel compiler team that
criticized codes that do array-of-structs instead of struct-of-arrays
in great detail :slight_smile:

the part where you *do* see a noticeable difference between intel and
gnu are in potentials that make a lot of use of functions from libm.
this is because the intel compiler silently replaces the call to libm
with code from an internal math library. the functions in glibc are
tuned for accuracy and for several releases specifically the
exponential functions (including pow()) were extremely expensive due
to excessive calls to some internal calls in the library to enforce
rounding. in any case, work is underway to make this a non-issue:

Indeed, this is quite common. We've observed this for quantum
chemistry codes as well since math.h functions are used extensively in
atomic integral codes.

https://svnweb.cern.ch/trac/vdt and (a bit more aggressive and in
plain C not C++) https://github.com/akohlmey/fastermath

Cool. I'll check it out.

If you want LAMMPS to run faster, take the force field kernel you use
the most, unroll the loops and insert some SSE/AVX/FMA intrinsics.
That should get you better performance than any optimizing compiler
and at the cost of only your time.

several people have tried this, there are even some publications on
those subjects. to make this work you need to change the data
structures and best also the neighbor list, so that you avoid the need
to chase pointers (which inhibits vectorization) and unaligned access
and gather operations.

I was planning to hit the addresses with assembly intrinsics that do
no checking, hence the compilers trepidation about pointers was not an
impediment, but I'll verify that things are as they should be w.r.t.
data access.

Best,

Jeff

Based on Axel’s suggestion (and since I will never have a small army of grad students, maybe a couple undergrads) I configured and priced a 64 core workstation (see below). If I understand correctly, the idea is to use these as stand alone systems. That is, for the typical size of my problem, I would never need to run in two of these machines and can therefore forgo the fastest networking options. This one was priced at approximately $9500.00.

A lammps specific question:

  1. How does the memory partitioning work? In lammps output I see:

Memory usage per processor = 3.96686 Mbytes

Is this per core running? Or per physical processor. Say I am running in a 2 processor quad-core machine with 8 mpi processes. Does this translate to 4 MBytes per core? Or 1 MByte per core? My gut tells me the former. The reason I ask is because I am trying to extrapolate to my biggest system of interest and depending on the answer 64 GB may be overkill.

And others not so lammps related but if anyone has input it would be welcome.

  1. Since I am not going to be running on a network… would it be wise to forgo the network controller and just use the integrated one on the motherboard? The main use for the network adapter would be to download occasional data for analysis and to connect to the machine via ssh.

  2. I am forgoing a RAID array in this configuration to stay within the $10,000.00 budget. One big concern I have is backing up the system but that is a conversation for another day.

  3. I have included a GPU in this configuration. In a cluster I can have a login node and can run several jobs simultaneously. Would an inherent disadvantage of the single workstation strategy be that this is not possible. Could I in principle run simultaneously a job with 32 cores, a job with 16 cores + gpu, and leave the other 16 cores for other activities (data processing, etc…)

Thanks for any input!

Selection Summary

Processor 4 x Sixteen-Core AMD Opteron™ Model 6378 - 2.4GHz 32MB Cache (115W TDP)
Motherboard AMD® SR5690+SR5670 Chipset - Dual Intel® Gigabit Ethernet - 8x LSI SAS2 Controller - IPMI 2.0 with LAN
Memory 8 x 8GB PC3-12800 1600Mhz DDR3 ECC Registered DIMM
Chassis Thinkmate® TWX-748TQ - 4U/Tower - 5 x 3.5" SAS/SATA - 1400W Redundant
Hard Drive 3.0TB SAS 2.0 6.0Gb/s 7200RPM - 3.5" - Seagate Constellation™ ES.3
5.25" Bay LG 14x Blu-Ray Disc Rewriter and DVD/CD Rewriter with M-Disc (SATA)
Video Card NVIDIA® Quadro® K4000 3.0GB GDDR5 (1xDVI-I DL, 2x DP)
Network Card Intel® 10-Gigabit Ethernet Converged Network Adapter X540-T2 (Copper) (2x RJ-45)
Peripherals Microsoft Wired Desktop 400 Keyboard and Mouse (USB)
Operating System Ubuntu Linux 12.04 LTS Server Edition (No Media) (Community Support) (32-bit/64-bit)
Operating System Installation Please install my selected operating system in 64-bit mode where applicable. (Pre-Installed)
Warranty Thinkmate® Three Year Warranty with Advanced Parts Replacement and RSL

Configured Tech Specs
Processors

Product Line Opteron 6300
Socket Socket G34
Clock Speed 2.40 GHz
HyperTransport 6.4 GT/s
L3 Cache 16 MB
L2 Cache 8x 2MB
Cores/Threads 16C / 16T
AMD Turbo Core Technology Yes
Wattage 115W

Memory

Technology DDR3
Type 240-pin RDIMM
Speed 1600 MHz
Error Checking ECC
Signal Processing Registered

Motherboards

North Bridge AMD SR5690+SR5670
Memory Technology DDR3 ECC Registered
Memory Slots 32 x 240-pin DIMMs
Expansion Slots 2x PCI Express 2.0 x16,
2x PCI Express 2.0 x8,
1x UIO
Graphics Controller Matrox G200 16MB DDR2 graphics
Network Controller Intel® 82576 Gigabit (2-port)
Back-panel Interfaces PS/2 keyboard and mouse ports,
7x USB 2.0 ports (2x rear, 4x header, 1x Type A),
2x RJ-45 LAN Ports,
1x RJ-45 Dedicated LAN for IPMI,
1x VGA port,
1x Fast UART 16550 Serial port,
1x Serial port header
On-Board Interfaces 6 x SATA,
8 x SAS,
1 x USB
USB 2.0 Ports 7 (2 rear ports, 1 onboard, 4 optional via header)
LAN Ports 3 (2 LAN, 1 IPMI)
SAS 6Gbps Ports 8
SATA 3Gbps Ports 6
VGA Ports 1

Video Cards

Memory Capacity 3 GB
Processor NVIDIA Quadro K4000
DisplayPort Output x2
DVI Output x1

Chassis

Product Type 4U or Tower
Color Black
Watts 1400W
External Drive Bays 5x 3.5" Hot-swap (SAS / SATA) Drive Bays
2x 5.25" Peripheral Drive Bay
1x 5.25" Bay for Floppy
Front Ports 2x USB Ports
Cooling Fans 3x 5000 RPM Hot-swap Cooling Fans,
3x 5000 RPM Hot-swap Rear Exhaust Fans

Optical Drives

Product Type BD-RE + DVDRW
Read Speed 12x BD-ROM, 16x DVD-ROM, 48x CD-ROM
Write Speed 14x BD-R, 16x DVD+/-R, 48x CD-R
Rewrite Speed 2x BD-RE, 6x DVD-RW, 8x DVD+RW, 24x CD-RW

Hard Drives

Rotational Speed 7200RPM
Cache 128MB

Network Cards

Transmission Speed 10Gbps Ethernet
Host Interface PCI Express 2.1 x8
Cable Medium Copper
Port Interface 2x RJ-45
VT for Connectivity (VT-c) VMDq
VT for Directed I/O (VT-d) Yes

STC > Based on Axel's suggestion (and since I will never have a small army of
STC > grad students, maybe a couple undergrads) I configured and priced a 64 core
STC > workstation (see below). If I understand correctly, the idea is to use
STC > these as stand alone systems. That is, for the typical size of my problem,
STC > I would never need to run in two of these machines and can therefore forgo
STC > the fastest networking options. This one was priced at approximately
STC > $9500.00.

that seems a bit expensive.

STC >
STC > A lammps specific question:
STC >
STC > 1. How does the memory partitioning work? In lammps output I see:
STC >
STC > Memory usage per processor = 3.96686 Mbytes

that is per MPI process. OpenMP threads share memory, MPI is no-share memory. please note that this is a lower limit for memory consumption. the biggest memory consumer are usually neighbor lists and complex analysis operations that need to store intermediate data. this first strongly depends on cutoff and particle density, the second on how convoluted an analysis you will program.
so only the number of atoms may not be a good number to go on.

STC > Is this per core running? Or per physical processor. Say I am running in a

neither. MPI was mostly designed to be hardware agnostic (at least from the point of the programmer).

STC > 2 processor quad-core machine with 8 mpi processes. Does this translate to
STC > 4 MBytes per core? Or 1 MByte per core? My gut tells me the former. The

yes.

STC > reason I ask is because I am trying to extrapolate to my biggest system of
STC > interest and depending on the answer 64 GB may be overkill.

yes, but you won't really save money by installing less. 1GB/core is pretty much the lower limit of what you can get.

STC > And others not so lammps related but if anyone has input it would be
STC > welcome.
STC >
STC > 2. Since I am not going to be running on a network... would it be wise to
STC > forgo the network controller and just use the integrated one on the
STC > motherboard? The main use for the network adapter would be to download
STC > occasional data for analysis and to connect to the machine via ssh.

where would you plug in a dual channel 10GigE network card? do you even have a 1GigE port available where you're going to place the machine?

STC > 3. I am forgoing a RAID array in this configuration to stay within the
STC > $10,000.00 budget. One big concern I have is backing up the system but that
STC > is a conversation for another day.

hard drives are cheap. stick a bunch of them into the machine and configure a software raid. works very well. in fact, given the size and failure rate of hard drives, i configure a software raid-1 or raid-5 in *any* desktop i use.

make sure you have a USB-3.0 port and you have no problem backing up. external harddrives with up to 1TB (at 2.5" formfactor) are quite cheap these days and pretty fast. i use them like floppy disks.

STC > 4. I have included a GPU in this configuration. In a cluster I can have a
STC > login node and can run several jobs simultaneously. Would an inherent
STC > disadvantage of the single workstation strategy be that this is not
STC > possible. Could I in principle run simultaneously a job with 32 cores, a
STC > job with 16 cores + gpu, and leave the other 16 cores for other activities
STC > (data processing, etc...)

<sigh>
if you want to do GPU computing, build a machine that does GPU computing well. this is not the platform for it. best get a small intel CPU based desktop machine and - since you are on a budget - don't waste your money on a hyper-expensive Quadro GPU, that you may be using only sparingly.
in general, it is almost always a bad idea to cram too many options into a machine. better get something that does what you do the most extremely well. the money you save on the junking the quadro, will buy you a nice little desktop with a powerful GeForce GPU and you'll enjoy lighning fast graphics with your favorite OpenGL based viz tool. it'll do GPU computing as well. ...and after you found out that GPU computing works for you, write a proposal to get time on one of the available machines with GPUs. the people running those are often quite *desperate* to find users that can use their machines (well). there are not that many applications around that make good use of GPUs and even with those that can, only a small number of people are willing to experiment with it and rather stick to what they know. as jeff pointed out, the effort to get external time is moderate, especially if you can show some experience and there you can get access to all kinds of things that you'd like to experiment with. i would not waste serious money, especially when you have a rather limited budget on anything local, unless you really know what you're doing.
</sigh>

STC > Thanks for any input!

well, you probably got more than you asked for.

axel.

STC >
STC > Selection SummaryProcessor4 x Sixteen-Core AMD Opteron™ Model 6378 - 2.4GHz
STC > 32MB Cache (115W TDP) MotherboardAMD® SR5690+SR5670 Chipset - Dual Intel®
STC > Gigabit Ethernet - 8x LSI SAS2 Controller - IPMI 2.0 with LANMemory8 x 8GB
STC > PC3-12800 1600Mhz DDR3 ECC Registered DIMM ChassisThinkmate® TWX-748TQ -
STC > 4U/Tower - 5 x 3.5" SAS/SATA - 1400W RedundantHard Drive3.0TB SAS 2.0
STC > 6.0Gb/s 7200RPM - 3.5" - Seagate Constellation™ ES.3 5.25" BayLG 14x
STC > Blu-Ray Disc Rewriter and DVD/CD Rewriter with M-Disc (SATA)Video CardNVIDIA®
STC > Quadro® K4000 3.0GB GDDR5 (1xDVI-I DL, 2x DP) Network CardIntel® 10-Gigabit
STC > Ethernet Converged Network Adapter X540-T2 (Copper) (2x
STC > RJ-45)PeripheralsMicrosoft
STC > Wired Desktop 400 Keyboard and Mouse (USB) Operating SystemUbuntu Linux
STC > 12.04 LTS Server Edition (No Media) (Community Support)
STC > (32-bit/64-bit)Operating
STC > System InstallationPlease install my selected operating system in 64-bit
STC > mode where applicable. (Pre-Installed) WarrantyThinkmate® Three Year
STC > Warranty with Advanced Parts Replacement and RSLConfigured Tech
STC > SpecsProcessorsProduct
STC > LineOpteron 6300SocketSocket G34Clock Speed2.40 GHzHyperTransport 6.4 GT/sL3
STC > Cache16 MBL2 Cache8x 2MBCores/Threads16C / 16TAMD Turbo Core Technology Yes
STC > Wattage115WMemoryTechnologyDDR3Type240-pin RDIMM Speed1600 MHzError Checking
STC > ECCSignal ProcessingRegisteredMotherboards North BridgeAMD SR5690+SR5670Memory
STC > TechnologyDDR3 ECC RegisteredMemory Slots32 x 240-pin DIMMsExpansion Slots 2x
STC > PCI Express 2.0 x16,
STC > 2x PCI Express 2.0 x8,
STC > 1x UIOGraphics ControllerMatrox G200 16MB DDR2 graphicsNetwork ControllerIntel®
STC > 82576 Gigabit (2-port) Back-panel InterfacesPS/2 keyboard and mouse ports,
STC > 7x USB 2.0 ports (2x rear, 4x header, 1x Type A),
STC > 2x RJ-45 LAN Ports,
STC > 1x RJ-45 Dedicated LAN for IPMI,
STC > 1x VGA port,
STC > 1x Fast UART 16550 Serial port,
STC > 1x Serial port headerOn-Board Interfaces6 x SATA,
STC > 8 x SAS,
STC > 1 x USBUSB 2.0 Ports7 (2 rear ports, 1 onboard, 4 optional via header) LAN
STC > Ports3 (2 LAN, 1 IPMI)SAS 6Gbps Ports8SATA 3Gbps Ports6VGA Ports1 Video
STC > CardsMemory Capacity3 GBProcessorNVIDIA Quadro K4000DisplayPort
STC > Outputx2DVI Output
STC > x1ChassisProduct Type4U or TowerColorBlackWatts 1400WExternal Drive Bays5x
STC > 3.5" Hot-swap (SAS / SATA) Drive Bays
STC > 2x 5.25" Peripheral Drive Bay
STC > 1x 5.25" Bay for FloppyFront Ports 2x USB PortsCooling Fans3x 5000 RPM
STC > Hot-swap Cooling Fans,
STC > 3x 5000 RPM Hot-swap Rear Exhaust FansOptical Drives Product TypeBD-RE +
STC > DVDRWRead Speed12x BD-ROM, 16x DVD-ROM, 48x CD-ROMWrite Speed14x BD-R, 16x
STC > DVD+/-R, 48x CD-RRewrite Speed 2x BD-RE, 6x DVD-RW, 8x DVD+RW, 24x CD-RWHard
STC > DrivesRotational Speed7200RPMCache128MB Network CardsTransmission Speed10Gbps
STC > EthernetHost InterfacePCI Express 2.1 x8Cable Medium CopperPort Interface2x
STC > RJ-45VT for Connectivity (VT-c)VMDqVT for Directed I/O (VT-d)Yes
STC >
STC >

Regarding the use of a public supercomputer (XSEDE, INCITE, PRACE):

  These computers are intended simulations of big systems (with >
100000 atoms). If I recall correctly, I tried to use an XSEDE
computer to run a large number of simulations of small systems (with <
1000 atoms) as separate jobs, and ended up waiting in the queue. It
was very inconvenient. There were work-arounds, but it ended up being
much easier just to run them on multiple (low-end) desktop machines we
had lying around at the time. (This is something which really annoys
me. I replied to your post mostly to warn you about this issue. You
don't have much freedom if you use big clusters like these.) The type
of simulations I currently run are not simple LAMMPS simulations.
(They are bash-scripts which invokes multiple LAMMPS jobs at the same
time, alternating between LAMMPS and a python script. Sometimes I
schedule them using chron.) Getting these to run on a big cluster
like XSEDE is possible but frustrating. So it depends on the nature
of the simulations you are running.

   Either way, you or your students/postdocs will definitely want to
have a machine to test LAMMPS simulations on before submitting to the
cluster (just to debug problems with their input scripts). A cheap
Dell (laptop or desktop) has no trouble installing ubuntu. but make
sure it has a reasonable graphics card for visualization.

   Getting LAMMPS to work with GPUs might require more effort on the
part of the system administrator (it took me about 3 days to figure it
out on my hardware). As far as cost, my impression is that a
reasonably higher-end desktop computer with 12-cores or 16-cores, a
single nvidia keppler GPU, and 64Gb of RAM, should be about $3000 or
less. Something lower-end should cost < $1000.

   Instructions for compiling LAMMPS with GPU support on ubuntu/debian
are available here:

http://lammps.sandia.gov/threads/msg36437.html

Cheers

Andrew

RAM requirements:

The script I use to prepare LAMMPS input files requires between 3GB to
12GB of RAM per million atoms. (However this is an upper-bound,
because I was not very careful about memory when I wrote this script.)

Regarding the use of a public supercomputer (XSEDE, INCITE, PRACE):

  These computers are intended simulations of big systems (with >
100000 atoms). If I recall correctly, I tried to use an XSEDE
computer to run a large number of simulations of small systems (with <
1000 atoms) as separate jobs, and ended up waiting in the queue. It
was very inconvenient. There were work-arounds, but it ended up being
much easier just to run them on multiple (low-end) desktop machines we
had lying around at the time. (This is something which really annoys
me. I replied to your post mostly to warn you about this issue. You
don't have much freedom if you use big clusters like these.) The type
of simulations I currently run are not simple LAMMPS simulations.
(They are bash-scripts which invokes multiple LAMMPS jobs at the same
time, alternating between LAMMPS and a python script. Sometimes I
schedule them using chron.) Getting these to run on a big cluster
like XSEDE is possible but frustrating. So it depends on the nature
of the simulations you are running.

well, this is a very specific need you are describing and part of your
problem is the choice of tools that you are using. you cannot blame
the machine for that. lammps does support running multiple concurrent
partitions natively.

as it so happens, i am currently working on making similar such kind
of solution work on a blue gene machine, where you simply cannot do
and it. it is not impossible to do these kind of things from inside a
single MPI program and just run multiple partitions and replace
scripts with suitable c/c++ code. i have even a fortran 90 code
talking to lammps while each of the two codes are running in parallel
on different processors.

   Either way, you or your students/postdocs will definitely want to
have a machine to test LAMMPS simulations on before submitting to the
cluster (just to debug problems with their input scripts). A cheap
Dell (laptop or desktop) has no trouble installing ubuntu. but make
sure it has a reasonable graphics card for visualization.

   Getting LAMMPS to work with GPUs might require more effort on the
part of the system administrator (it took me about 3 days to figure it
out on my hardware). As far as cost, my impression is that a
reasonably higher-end desktop computer with 12-cores or 16-cores, a
single nvidia keppler GPU, and 64Gb of RAM, should be about $3000 or
less. Something lower-end should cost < $1000.

   Instructions for compiling LAMMPS with GPU support on ubuntu/debian
are available here:

http://lammps.sandia.gov/threads/msg36437.html

buying a machine with GPUs when you have no experience with them and
no knowledge how well they support they features you need, is a bad
idea. especially at the high end, taking good advantage of GPUs takes
some effort and ingenuity.

practically none of the existing larger scale accelerator deployments
are good examples. they all have problems with having to deal with the
hardware limitations. one has to make choices and compromises and how
much affect those somebody can depend on many little details.

axel.