Friday, April 26, 2013

The Mythical Big Data Supercomputer


by Eric Whyne
Originally posted on Data Tactics Blog.

A friend recently made the conjecture that "supercomputers have been doing big data research for decades". I think about, read about, or work with data systems every day and his comment caught me by surprise. I was surprised because I had never made the connection in my mind before between supercomputers and big data systems. They are so different to me that I don't even see them as similar things. I'll try to explain why I feel that way here.

Blue Gene/P Supercomputer
First, my definition of a supercomputer is any architecture that is listed on http://www.top500.org/. Smaller clusters that mirror these architectures fit the definition. Large clusters of computers make this list by being a good solution for the Linpack Benchmark. It's a benchmark that involves solving a dense system of linear equations, this is basically a measure of the system's floating-point rate of execution. This is a good way to think of a supercomputer; there is a focus is on doing fast computation. Supercomputers dominate when a problem can be described in a relatively small amount of data and the processors can go to work on it. If you look at the specifications for the computers on top500, you'll note that something is missing. Typically when we quantify computers we include three numbers.  Those three numbers are # of cores, amount of memory, and amount of raw disk space. I think the supercomputers don't include raw disk space on the top500 specs because it's usually embarrassingly small. It might not be tactful to call it embarrassing, but it's smaller than you'd expect. Sometimes only a few gigs per core.

Yes, supercomputers can be great at big data problems, if you consider math or raw combinatorial problems (aka cracking passwords and crypto) to be big data problems. I do. However, don't expect to go exploring massive amounts of data or doing data correlation activities on these systems. The focus on memory and math makes them great for running simulations like quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modelling, wind tunnels, and analysing the detonation of nuclear weapons. 

Now enter the typical big data systems. These systems are designed to avoid
Supercomputer interconnect
reliance on networks. Supercomputers get around that limitation of disk space by trying to increase network efficiency. They began pushing data around via custom interconnects over fiber cables. It's not enough, the network is still a tight bottleneck. Even with algorithms that reduce network overhead, pushing data over the network is still inefficient for most problems.

Here's the deep insight for this blog post and I'm quoting somebody smarter than me: "The reason we tolerate hadoop and HDFS is because that's where the data is". Big data systems preposition the data on compute nodes and organize it. It's a way of moving the processing power closer to the data to reduce the amount of times data has to be pushed over the network. This approach scales for size of data. Networks can only get linearly more efficient, data size can increase much faster. As a long term solution, making the bottlenecks faster doesn't fix the problem. They need to be bypassed completely for the system to horizontally scale. Eventually the improvements to the network and the amount of data being exchanged over the network collides, depending on the algorithm you are trying to execute, and the supercomputer systems networks choke on too much data. The processors begin spinning with no traction. In the case that the processors are generating their own solutions to evaluate (i.e. simulations or combinatorial problems) this doesn't happen. Another way to phrase this is that they have good capability but mediocre capacity. Sometimes you'll hear the terms Capacity computing and Capability computing thrown about.

Real big data technology is facilitated by distributed, scalable, and portable data stores. Hadoop Distributed File System (HDFS) is the big dog in this realm right now. Rack awareness, auto-balancing, duplication of data in triplicate. It has a ton of great features. MapR File System has duplicated most of the functionality, but taken it in another direction by attempting to make it posix compliant. This means that normal unix operating systems can mount it like any other file system. Other technology is out there that rides on these file systems or has their own methods of being distributed, scalable, and portable. As another general rule, big data systems keep with commodity hardware. Rather than optimizing the network and moving away from commodity hardware, serialization technology such as Apache Thrift and Google Protocol Buffers compress data before it's sent over the network.

The differences between supercomputers and big data systems may seem nuanced at first but become substantial as you evaluate the types of problems that each can be used for. In a future post I'll discuss the differences between vertical scalability and horizontal capability as a more in depth look at these different approaches and how each can have it's place in various missions.