Tuesday, February 3, 2015

Big data benchmarks and High Performance Computing protocols

I was asked today about benchmarking between high performance computing (HPC) and big data systems. In the context, the discussion was about different software architectures deployed across commodity hardware; not the specialized systems you see on the top500 list. To clarify: supercomputers are something you buy, high performance computing (HPC) is something you do. Don't get them mixed up. Not very many people play in both areas, the few that do don't seem to think very empirically about both; so there's not much concrete information go by. This makes the question a difficult one, but this was my best shot at answering it at the moment.

My immediate response was that the lighter weight, less fault tolerant, HPC architectures will always be faster given the same number of resources, but they are of limited use for big data. Benchmarks comparing big data and HPC architectures to each other don't exist in a meaningful way because the benchmark algorithms used are different. Big data's most prominent benchmarks are sorting problems which operate on commonly available data. HPC's most prominent benchmarks are algorithmic, for example Linpack TPP, which is a system of linear equations (i.e. math stuff with no data).

In my own experiences, heterogeneous compute environments (aka GPUs) typically have used High Performance Computing protocols, like Message Passing Interface (MPI) based applications. This is because the jobs they are used for don't take a lot of time (right now at least). HPC protocols like MPI are notoriously bad at fault tolerance, so software or hardware failures cause the entire job to be restarted. We use HPC protocols because there's lots of legacy code written for them and it keeps the jobs fast, starting over isn't much if a big deal because the jobs typically run so fast anyway.

Restarts on the big data jobs are more of a problem because the jobs often run for much longer, meaning lots of wasted time if you have to go back to the beginning. A compute job running over 10,000 compute hours is almost certain to have a hardware failure. Software failures are much more common (locking conditions, out of memory, segmentation faults, bad data, a million other ways to fail).  Because of this, big data software is more fault tolerant. This makes it slower, but able to do those long running jobs on big data. Big data software has been getting drastically faster, but at the cost of using more computers simultaneously to attain that speed.

Lots of people have attempted to benchmark performance across the different systems (which are myriad in both of these categories I've mentioned here). But none of the benchmark standards have seen very wide adoption. This is because they are like the systems they're trying to benchmark: new, with plenty of room to get better. The fundamentals keep changing. In my mind this speaks to how new all of this stuff really still is. Everyone is still trying to find the best way to do things and making massive progress. It's an exciting time to be working with these technologies.

Some good benchmarks and graphics comparing the various platforms:
https://amplab.cs.berkeley.edu/benchmark/
http://www.bigdatatop100.org/
http://icl.cs.utk.edu/hpcc/