Friday, April 26, 2013

The Mythical Big Data Supercomputer


by Eric Whyne
Originally posted on Data Tactics Blog.

A friend recently made the conjecture that "supercomputers have been doing big data research for decades". I think about, read about, or work with data systems every day and his comment caught me by surprise. I was surprised because I had never made the connection in my mind before between supercomputers and big data systems. They are so different to me that I don't even see them as similar things. I'll try to explain why I feel that way here.

Blue Gene/P Supercomputer
First, my definition of a supercomputer is any architecture that is listed on http://www.top500.org/. Smaller clusters that mirror these architectures fit the definition. Large clusters of computers make this list by being a good solution for the Linpack Benchmark. It's a benchmark that involves solving a dense system of linear equations, this is basically a measure of the system's floating-point rate of execution. This is a good way to think of a supercomputer; there is a focus is on doing fast computation. Supercomputers dominate when a problem can be described in a relatively small amount of data and the processors can go to work on it. If you look at the specifications for the computers on top500, you'll note that something is missing. Typically when we quantify computers we include three numbers.  Those three numbers are # of cores, amount of memory, and amount of raw disk space. I think the supercomputers don't include raw disk space on the top500 specs because it's usually embarrassingly small. It might not be tactful to call it embarrassing, but it's smaller than you'd expect. Sometimes only a few gigs per core.

Yes, supercomputers can be great at big data problems, if you consider math or raw combinatorial problems (aka cracking passwords and crypto) to be big data problems. I do. However, don't expect to go exploring massive amounts of data or doing data correlation activities on these systems. The focus on memory and math makes them great for running simulations like quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modelling, wind tunnels, and analysing the detonation of nuclear weapons. 

Now enter the typical big data systems. These systems are designed to avoid
Supercomputer interconnect
reliance on networks. Supercomputers get around that limitation of disk space by trying to increase network efficiency. They began pushing data around via custom interconnects over fiber cables. It's not enough, the network is still a tight bottleneck. Even with algorithms that reduce network overhead, pushing data over the network is still inefficient for most problems.

Here's the deep insight for this blog post and I'm quoting somebody smarter than me: "The reason we tolerate hadoop and HDFS is because that's where the data is". Big data systems preposition the data on compute nodes and organize it. It's a way of moving the processing power closer to the data to reduce the amount of times data has to be pushed over the network. This approach scales for size of data. Networks can only get linearly more efficient, data size can increase much faster. As a long term solution, making the bottlenecks faster doesn't fix the problem. They need to be bypassed completely for the system to horizontally scale. Eventually the improvements to the network and the amount of data being exchanged over the network collides, depending on the algorithm you are trying to execute, and the supercomputer systems networks choke on too much data. The processors begin spinning with no traction. In the case that the processors are generating their own solutions to evaluate (i.e. simulations or combinatorial problems) this doesn't happen. Another way to phrase this is that they have good capability but mediocre capacity. Sometimes you'll hear the terms Capacity computing and Capability computing thrown about.

Real big data technology is facilitated by distributed, scalable, and portable data stores. Hadoop Distributed File System (HDFS) is the big dog in this realm right now. Rack awareness, auto-balancing, duplication of data in triplicate. It has a ton of great features. MapR File System has duplicated most of the functionality, but taken it in another direction by attempting to make it posix compliant. This means that normal unix operating systems can mount it like any other file system. Other technology is out there that rides on these file systems or has their own methods of being distributed, scalable, and portable. As another general rule, big data systems keep with commodity hardware. Rather than optimizing the network and moving away from commodity hardware, serialization technology such as Apache Thrift and Google Protocol Buffers compress data before it's sent over the network.

The differences between supercomputers and big data systems may seem nuanced at first but become substantial as you evaluate the types of problems that each can be used for. In a future post I'll discuss the differences between vertical scalability and horizontal capability as a more in depth look at these different approaches and how each can have it's place in various missions.

Wednesday, April 24, 2013

Dial indicator, heat deflection, and upgrades

I just posted a video on youtube: "Dial indicator measuring metal expansion and subsequent deflection caused by application of heat". Couldn't sleep last night, and it's too early to go for a run so I ended up messing around in the garage with my new dial indicator that I bought off Amazon for 20$. After testing it with some known thickness wafers I made this setup to measure metal deflection.



It's common knowledge that heat causes things to expand. That's why running warm water over a jar lid can make it easier to loosen. Heating stuck nuts on bolts can also make them easier to remove.

Non-uniform heating of materials causes deflection. The sites where heat is applied expands and the cooler parts remain the same. The most interesting application of this that I've personally seen is how some car turn signals work. There is a little metal wafer with a heating coil running down one side. When you turn on the turn signal this coil heats up and within a little less than a second it warps the wafer (just like the rod does in my video but much more dramatically) so that it closes an electrical connection and lights up the turning signals on the outside of the car. Since the heating coil is bypassed by this, it rapidly cools down and the heat goes away causing the wafer to return to it's original shape. This disconnects the circuit, the coil heats up, and the cycle starts over. When I first took apart one of these relays I was astounded that they hadn't used solid state electronics. It just seemed that electronics would be a better way of doing this and far more predictable. But, after some contemplation, the wafer coil setup is probably cheaper to make and it's not like car turning signals need clock like accuracy anyway. If it works, it works; why change it.

Sometimes I think we succumb to the desire to change things because we have identified a better solution. But without looking at the original requirements and weighing the cost, this can turn into a waste of time and effort. I'm not saying that we shouldn't look for better solutions, but it's a good idea to make sure than in the context of the application and it's mission that they actually are in fact better. The most recent examples of this are people moving to cloud analytic frameworks and nosql. Most business intelligence data that I've interacted with isn't that big. Maybe a few hundred thousand records. Certainly some is larger, and everyone wants to crunch social media data which is certainly gigantic. But if your BI data can fit on a single hard drive or even in the ram of a single computer then by all means stick with a simple setup! This is pretty much anything less than a Terabyte of data... which is pretty much everybody. If you really need to figure out what people on twitter or facebook are saying about you, subscribe to services that analyse social media for you, or... use the native functionality of the social media services themselves. Strategy trumps technology in every BI scenario that I can think of. Sometimes yesterdays technology can support today's mission just fine.

Forge update


I've had a few more chances to use and refine my forge. There have been two major changes since the last time I blogged about it:

1. I shortened the air inlet pipe. It turns out that the pipe a few inches below the fire pot never gets warm even with extended use. There's no risk of melting the hair dryer where it's located now. The shorter pipe gets in the way less and makes the overall stability better.

2. I added a shield around the fire pot. It's just a piece of sheet metal attached with rods that I pounded into the brake rotor cooling fins. I attached it in two pieces and left some of it unattached at the back joint so that I can bend open the back part of it in case I need to heat the middle of something longer. If you look carefully in the picture you can see the opening.



Here is a picture of the Anvil I've pulled together. It took me a few weeks of searching before I found a good log to mount it on. This anvil will be left outside most of the time so I cut the angles on the log with a chain saw and put a coat of boiled linseed oil on it for longevity. Occasional boiled linseed oil on the anvil has kept the rust away as well. Note: if you use rags to apply linseed oil, be careful how you dispose of them. This is the stuff that creates heat as it dries and has been known to cause fires if the rags are left in piles or thrown in the trash.

The forge lives outside as well. One concern I have is that the oxidation seems to be accelerated at the points where the most heat is concentrated. I suspect this is because the heat burns off any protective oils that may have been on the metal. I've taken care to try to clean it up better when I stop using it. Whereas I used to just leave the coals and coke in the forge when I was done, caring for it properly means waiting for the fire pit to cool down, cleaning out the coke and dust and then wiping it with an oily rag.

The first thing I attempted to create with the forge was a pair of tongs from some rebar I had been using in my garden. They aren't pretty, but they were made completely on the new forge and work great. I drifted the holes for the rivet and locking bar. Now I won't have to use my good pliers in the fire. If these burn up I can always fix them or make a new (better) set.

One of my future projects will probably be a propane forge. It was fun to learn how to use coal, but the logistics of getting it are too difficult. I've had to import what little I have from Pennsylvania. Charcoal is an option but the briquettes don't work well. They seem to burn fine but when you disturb them the air from the forge causes them to explode into a shower of glowing hot sparks. I ruined a shirt and had tiny burn marks on my face and arms for a while. Safety glasses and gloves are a must when getting anywhere near this project. It will be easier to use common grill propane once I figure out how to do that.