Wednesday, May 22, 2013

Data Science Dreams

Recently, I was asked in an email what my dream data science environment would contain. This is what I ended up writing back. I wrote it quickly and immediately, so I didn't have much time to pontificate or doubt myself; there's an extra level of candor to my response.  

Things I've seen work well:

1. Ability to support rapid micro-acquisitions in order to support research pivots. A reliable admin clerk / very junior analyst to track the micro-acquisitions so it can be accounted for and re-used when appropriate. (Think how often we've all seen wasted software licenses laying around...) 
 2. After a goal has been identified, a set-aside time (even just a day) to think outside the box, read about potentially useful technologies, and decide on course of action that most probably leads to success before just trying to deliver something or use the same old hammer. 
3. A compelling and clear mission statement, and regular re-iteration of leadership intent. 
4. Low stress leadership. Nobody ever wants to be bullied, especially people that think for a living.
5. Rotational personnel assignments. Give everybody a chance to leave mental baggage behind and work on a new problem area at least every 6 months or a year.
6. Everyone, even senior leadership, should have a basic understand of computer science "physics".  This is our profession, if you're going to sit in meetings and make decisions on this stuff you need to put in the due diligence to know your profession. Eg: "why indexes are important" "what does normalized and denormalized mean" "structures of common file formats" "speed differences between cache, ram, disk, and network" "why computers use binary" "basic understanding of fundamental algorithms: hashes, sorters, combinatorial, correlation" and everything else in computer science 101 books.
7. Recognize contributions to the team rather than individual esteem. This a tough nut to crack, but if individuals are put on pedestals it begins a long slow rot of overall organizational moral. Folks will start hiding results from each other and the environment becomes uncomfortably competitive and self-advancement and self-focus becomes the primary goal rather than the advancement of the science. It might be that all organizations tend this way, rotational assignments may help prolong it from taking root.
8. Just like digging for gold or diamonds, expect to move a lot of dirt first. When you find diamonds, understand there's probably a lot of dirt to be moved to find the next one. Avoid over-using the past successes to perpetuate the prestige of the organization. If there hasn't been a major success in the last 6 months, your people have given up looking and are resting on their laurels.
9. ETL pipeleine... bah whatever. Maintain low-performance authoritive sources instead. This allows you to accept data very easily without the pain of trying to figure out ETL every time a new data source comes in or becomes available. Get the data in a sharded flat file format on a SAN as close to the raw format as possible. Record the providence and origin. Call this your authoritive data source. As use cases are required, move subsections of the data into appropriate technologies or data structures for analysis. Keep the extract, transform, and load code snippets in an open and well labeled repository to avoid re-doing this work and to provide a path for duplication/verification of success.


It may be surprising that most of my dream criteria deal with how people interact with each other. You'll probably notice from my other blog posts that I focus a lot on that. People are the most important thing every time. If they aren't working well together it doesn't matter how good your technology  or physical environment is.

Wednesday, May 15, 2013

Secret sauce algorithms

I think too often we take algorithmic success at the face value of the claims being made. Algorithms are discussed as though they should be kept secret to protect them, but in my opinion it's at the very least irresponsible and in some rarer circumstances intentionally misleading and unethical to keep algorithms locked up in secret. The importance of not being different was an essay written by Bruce Schneier in 1999.  That's 14 years ago now, but I still find myself referencing the points he made when talking about data systems and algorithms; and not just cryptography algorithms. Imagine that we executed the same level of diligence to medicine that we do to our information systems and missions. Suppose your doctor said: "I realize antibiotics are proven through decades of research to cure your condition with no harmful side effects.  But, I've developed my own better cure that I've only tested on a few of my patients and if you pay me I'll give it to you." Would you take the pill? If you wouldn't bet your life on it why would you bet your organization's mission on it?

Algorithms, like medical research, need peer review and open and objective study before the claims can be validated and the results trusted. Don Knuth's series of books titled "The Art of Computer Programming" started chronicling programming algorithms and their history and analysis in 1962. The latest volume was published in 2011 and there are more on the way. When fishing for the right way to approach a data problem, they are the most comprehensive resource I know of to find good algorithms. If a method is too new to be included in a resource like TOCP, you'll probably find it being addressed in research papers or on blogs (like this one).

Once you know the name of the approach needed, a quick google search will almost always turn up implementations of the needed algorithm in whatever programming language you need. Thanks to open source software and people sharing solutions online, the availability of good solutions has never been better. All indicators seem to point to this trend continuing. What does that say about the future of software to you? To me it speaks to a more capable future where real value is shared freely and not locked up.

The next time somebody approaches you and claims their software does X better because of their "secret sauce" algorithm; be as skeptical as though they were a doctor offering you a "secret ingredient" pill.  Actually, there's a time tested term for fraudulent health products: "snake oil". Maybe it's time for an algorithm specific version of that term. If you spend money on closed source software, buy it because the interface works and the implementation is good; not because the proprietary algorithms are advertised as "better".