by Eric Whyne
Originally posted on Data Tactics Blog.
A friend recently made the conjecture that "supercomputers have been doing big data research for decades". I think about, read about, or work with data systems every day and his comment caught me by surprise. I was surprised because I had never made the connection in my mind before between supercomputers and big data systems. They are so different to me that I don't even see them as similar things. I'll try to explain why I feel that way here.
|Blue Gene/P Supercomputer|
Yes, supercomputers can be great at big data problems, if you consider math or raw combinatorial problems (aka cracking passwords and crypto) to be big data problems. I do. However, don't expect to go exploring massive amounts of data or doing data correlation activities on these systems. The focus on memory and math makes them great for running simulations like quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modelling, wind tunnels, and analysing the detonation of nuclear weapons.
Now enter the typical big data systems. These systems are designed to avoid
Here's the deep insight for this blog post and I'm quoting somebody smarter than me: "The reason we tolerate hadoop and HDFS is because that's where the data is". Big data systems preposition the data on compute nodes and organize it. It's a way of moving the processing power closer to the data to reduce the amount of times data has to be pushed over the network. This approach scales for size of data. Networks can only get linearly more efficient, data size can increase much faster. As a long term solution, making the bottlenecks faster doesn't fix the problem. They need to be bypassed completely for the system to horizontally scale. Eventually the improvements to the network and the amount of data being exchanged over the network collides, depending on the algorithm you are trying to execute, and the supercomputer systems networks choke on too much data. The processors begin spinning with no traction. In the case that the processors are generating their own solutions to evaluate (i.e. simulations or combinatorial problems) this doesn't happen. Another way to phrase this is that they have good capability but mediocre capacity. Sometimes you'll hear the terms Capacity computing and Capability computing thrown about.
Real big data technology is facilitated by distributed, scalable, and portable data stores. Hadoop Distributed File System (HDFS) is the big dog in this realm right now. Rack awareness, auto-balancing, duplication of data in triplicate. It has a ton of great features. MapR File System has duplicated most of the functionality, but taken it in another direction by attempting to make it posix compliant. This means that normal unix operating systems can mount it like any other file system. Other technology is out there that rides on these file systems or has their own methods of being distributed, scalable, and portable. As another general rule, big data systems keep with commodity hardware. Rather than optimizing the network and moving away from commodity hardware, serialization technology such as Apache Thrift and Google Protocol Buffers compress data before it's sent over the network.
The differences between supercomputers and big data systems may seem nuanced at first but become substantial as you evaluate the types of problems that each can be used for. In a future post I'll discuss the differences between vertical scalability and horizontal capability as a more in depth look at these different approaches and how each can have it's place in various missions.