Wednesday, May 22, 2013

Data Science Dreams

Recently, I was asked in an email what my dream data science environment would contain. This is what I ended up writing back. I wrote it quickly and immediately, so I didn't have much time to pontificate or doubt myself; there's an extra level of candor to my response.  

Things I've seen work well:

1. Ability to support rapid micro-acquisitions in order to support research pivots. A reliable admin clerk / very junior analyst to track the micro-acquisitions so it can be accounted for and re-used when appropriate. (Think how often we've all seen wasted software licenses laying around...) 
 2. After a goal has been identified, a set-aside time (even just a day) to think outside the box, read about potentially useful technologies, and decide on course of action that most probably leads to success before just trying to deliver something or use the same old hammer. 
3. A compelling and clear mission statement, and regular re-iteration of leadership intent. 
4. Low stress leadership. Nobody ever wants to be bullied, especially people that think for a living.
5. Rotational personnel assignments. Give everybody a chance to leave mental baggage behind and work on a new problem area at least every 6 months or a year.
6. Everyone, even senior leadership, should have a basic understand of computer science "physics".  This is our profession, if you're going to sit in meetings and make decisions on this stuff you need to put in the due diligence to know your profession. Eg: "why indexes are important" "what does normalized and denormalized mean" "structures of common file formats" "speed differences between cache, ram, disk, and network" "why computers use binary" "basic understanding of fundamental algorithms: hashes, sorters, combinatorial, correlation" and everything else in computer science 101 books.
7. Recognize contributions to the team rather than individual esteem. This a tough nut to crack, but if individuals are put on pedestals it begins a long slow rot of overall organizational moral. Folks will start hiding results from each other and the environment becomes uncomfortably competitive and self-advancement and self-focus becomes the primary goal rather than the advancement of the science. It might be that all organizations tend this way, rotational assignments may help prolong it from taking root.
8. Just like digging for gold or diamonds, expect to move a lot of dirt first. When you find diamonds, understand there's probably a lot of dirt to be moved to find the next one. Avoid over-using the past successes to perpetuate the prestige of the organization. If there hasn't been a major success in the last 6 months, your people have given up looking and are resting on their laurels.
9. ETL pipeleine... bah whatever. Maintain low-performance authoritive sources instead. This allows you to accept data very easily without the pain of trying to figure out ETL every time a new data source comes in or becomes available. Get the data in a sharded flat file format on a SAN as close to the raw format as possible. Record the providence and origin. Call this your authoritive data source. As use cases are required, move subsections of the data into appropriate technologies or data structures for analysis. Keep the extract, transform, and load code snippets in an open and well labeled repository to avoid re-doing this work and to provide a path for duplication/verification of success.


It may be surprising that most of my dream criteria deal with how people interact with each other. You'll probably notice from my other blog posts that I focus a lot on that. People are the most important thing every time. If they aren't working well together it doesn't matter how good your technology  or physical environment is.