I was asked today about benchmarking between high performance computing (HPC) and big data systems. In the context, the discussion was about different software architectures deployed across commodity hardware; not the specialized systems you see on the top500 list. To clarify: supercomputers are something you buy, high performance computing (HPC) is something you do. Don't get them mixed up. Not very many people play in both areas, the few that do don't seem to think very empirically about both; so there's not much concrete information go by. This makes the question a difficult one, but this was my best shot at answering it at the moment.
My immediate response was that the lighter
weight, less fault tolerant, HPC architectures will always be faster given
the same number of resources, but they are of limited use for big data.
Benchmarks comparing big data and HPC architectures to each other don't
exist in a meaningful
way because the benchmark algorithms used are different. Big data's most
prominent benchmarks are sorting problems which operate on commonly available data. HPC's most prominent benchmarks are algorithmic, for example Linpack TPP, which is a system of linear equations
(i.e. math stuff with no data).
In my own experiences, heterogeneous compute environments (aka GPUs) typically have used High
Performance Computing protocols, like Message Passing Interface (MPI) based applications. This
is because the jobs they are used for don't take a lot of time (right
now at least). HPC protocols like MPI are notoriously
bad at fault tolerance, so software or hardware failures cause the
entire job to be restarted. We use HPC protocols because there's lots of
legacy code written for them and it keeps the jobs fast, starting over
isn't much if a big deal because the jobs typically
run so fast anyway.
Restarts on the big data jobs are more of a problem because the jobs
often run for much longer, meaning lots of wasted time if you have to go
back to the beginning. A compute job running over 10,000 compute hours
is almost certain to have a hardware failure.
Software failures are much more common (locking conditions, out of
memory, segmentation faults, bad data, a million other ways to fail). Because of this, big
data software is more fault tolerant. This makes it slower, but able to
do those long running jobs on big data. Big data
software has been getting drastically faster, but at the cost of using
more computers simultaneously to attain that speed.
Lots of people have attempted to benchmark performance across the
different systems (which are myriad in both of these categories I've mentioned here). But none of the benchmark standards have seen very wide adoption. This is because
they are like the systems they're trying to benchmark:
new, with plenty of room to get better. The fundamentals keep changing. In my mind this speaks to how new all of this stuff really still is. Everyone is still trying to find the best way to do things and making massive progress. It's an exciting time to be working with these technologies.
Some good benchmarks and graphics comparing the various platforms: