Wednesday, September 4, 2013

Three t-shirts this week

Woke up this Sunday morning (today) and had a choice of a few new t-shirts to wear while running errands and playing with my kid.

Last Saturday I ran and finished a Super Spartan in Virginia. I received two t-shirts there, one was a team shirt the 8 guys I ran it with had made, the other was a shirt for finishing. I was far from dominating, but did finish it. Was not ready for the 8 miles of obstacle race to be up and down ski slopes. Next time somebody talks me into one of these I'm going to train harder.

Wednesday of the next week I flew to UC Berkeley to attend their AMP Lab's Big Data Bootcamp, Amp Camp. I didn't have to climb any obstacles to get a t-shirt there, but the course material was excellent. Although I've been reading about and playing with components of their big data stack BDAS for a few months, I learned a great deal and was introduced to a few new pieces of software: BlinkDB and MLBase. Was more excited about the latter, making ML easier is more exciting to me than tuning query response time/accuracy at the DB level, but they both have potential. Spark/Shark I already consider successful, and I'm even more of a fan now. A lot of the code for all the software covered was written in the last few months. I expect the newer projects to change a lot, even more than the rest of the big data software has been the last year or so. They outlined a road-map, but it's a long way from here to there. I thought the best presentation was on Mesos. I'm excited to explore Chronos more which is written in Scala on top of Mesos. But since Chronos wasn't an AMP Lab piece of software, it was only mentioned and not covered in detail.

They provided a 5 node EC2 cluster for each attendee in the class, for which they emailed us private keys. I thought this was a great way to step through the exercises. During previous training I've taken on this I've had to bring a beefy laptop and spin up enough virtual machines to emulate a real cluster. Although it does require considerably more capital. On EC2 a 5 large nodes cost about $2.88 per hour, which is more expensive than free but well worth the money to crunch on 20G+ of data as part of the training exercises.

In the airport on the way back home I managed to duplicate most of the cluster provided during the class by stepping through this article on Amazon: http://aws.amazon.com/articles/4926593393724923
One hiccup was the elastic-mapreduce program is not compatible with the latest version of Ruby. To make it more challenging, Ubuntu 13.04 apt-get doesn't work for installing the Ruby Version Manager rvm. So I had to install rvm from binary, then rvm --use 1.8.7 or something and then the Amazon map-reduce script worked great. Kind of. The first cluster I tried bringing up failed when hadoop didn't start. On my next try, it worked. This is all still alpha software... Impressive nonetheless since I was able to get the functionality we covered in class on my own with only a bit of mucking around. That's kind of like a spartan race obstacle...

I'm wearing my AMP Camp t-shirt today...