Sunday, March 17, 2013

Bulldozer or a bus?

Is your system a Bulldozer or a Bus? It can't be both at the same time.

This is a quick introduction to a metaphor. I made it up last week, maybe it will catch on. Big data systems fall into one of two categories: they are either a bulldozer or a bus.

I know this is a picture of an excavator. Bulldozer creates better alliteration and for our metaphor they both fit.
First: Bulldozers. Bulldozers have relatively few users (the drivers) and they need to move a lot of dirt (data). All of the mechanics and resources of the system are aligned for this purpose, moving (computing) lots of stuff. Bulldozers are often used to find inferences or create things that serve the purposes of buses. I hate to risk stretching the metaphor too far, but you can compare the creation of pre-computed indexes to bulldozers creating "roads" for buses to drive on. It's a great example of how the these different types of systems support each other.

Buses, unlike bulldozers are designed to handle lots of users (passengers) on the system. Computer resources are allocated, optimized, and balanced to handle the processing required when dealing with many simultaneous users. 

The "data physics" of bulldozers and buses are different. Mixing these roles leads to poor performance and upset stakeholders.
Some specific examples:
Buses:
  • Cloud Foundry
  • Heroku
  • Cloudfront
  • App Engine
  • OpenShift
  • Cloudify
  • AppFog
Bulldozers:
  • Most of the Hadoop ecosystem
  • Most of the Greenplumb products
  • Datameer
  • InfoSphere
Software that supports both types of systems (but could be easily mistaken for one or the other... in other words they could be categorized by context):
  • Any of the Infrastructure As A Service systems (AWS, Azure, Rackspace, VMWare)
  • Much of the database technology out there (Mysql, SQL Server, Oracle DB, SAS)
  • Most visualization software
Even programming languages can loosely be categorized with these metaphors. All general purpose programming languages can really serve as functioning components on a bus, but languages such as R and Matlab are designed to be used as parts on bulldozers.

One of the defining themes of the big data era has been an awareness that bulldozer tasks are not ephemeral and more organizations are investing in bulldozer systems specifically to support them. In my opinion, the biggest danger comes when folks spend money to load bulldozer features onto a bus to try to make it perform better. It often doesn't work, but plausibly could in some circumstances which is why people keep trying to do it. Another common occurrence is trying to allow too many users on a bulldozer. Bulldozers make poor buses.  Trying to use bulldozes as buses not only destroys bulldozer performance, but usually assumes a level of subject matter expertise that isn't present. Not everyone is qualified to be a driver of a powerful and flexible system.

A good metaphor will help explain the differences and help guide expectations for both yourself and your customers as you work with these very different systems in your enterprise.