Recently, I was asked in an email what my dream data science environment would contain. This is what I ended up writing back. I wrote it quickly and immediately, so I didn't have much time to pontificate or doubt myself; there's an extra level of candor to my response.
Things I've seen work well:
1. Ability to support rapid micro-acquisitions in
order to support research pivots. A reliable admin clerk / very junior
analyst to track the micro-acquisitions so it can be accounted for and
re-used when appropriate. (Think how often we've
all seen wasted software licenses laying around...)
2. After a goal has been identified, a set-aside time (even just a day)
to think outside the box, read about potentially useful technologies,
and decide on course of action that most probably leads to success
before just trying to deliver something or use the
same old hammer.
3. A compelling and clear mission statement, and regular re-iteration of
4. Low stress leadership. Nobody ever wants to be bullied, especially people that think for a living.
5. Rotational personnel assignments. Give everybody a chance to leave
mental baggage behind and work on a new problem area at least every 6
months or a year.
6. Everyone, even senior leadership, should have a basic understand of
computer science "physics". This is our profession, if you're going to sit in
meetings and make decisions on this stuff you need
to put in the due diligence to know your profession. Eg: "why indexes
are important" "what does normalized and denormalized mean" "structures
of common file formats" "speed differences between cache, ram, disk, and
network" "why computers use binary" "basic
understanding of fundamental algorithms: hashes, sorters, combinatorial,
correlation" and everything else in computer science 101 books.
7. Recognize contributions to the team rather than individual esteem.
This a tough nut to crack, but if individuals are put on pedestals it
begins a long slow rot of overall organizational moral. Folks will start
hiding results from each other and the environment
becomes uncomfortably competitive and self-advancement and self-focus
becomes the primary goal rather than the advancement of the science. It
might be that all organizations tend this way, rotational assignments
may help prolong it from taking root.
8. Just like digging for gold or diamonds, expect to move a lot of dirt
first. When you find diamonds, understand there's probably a lot of dirt
to be moved to find the next one. Avoid over-using the past successes
to perpetuate the prestige of the organization.
If there hasn't been a major success in the last 6 months, your people
have given up looking and are resting on their laurels.
9. ETL pipeleine... bah whatever. Maintain low-performance authoritive
sources instead. This allows you to accept data very easily without the
pain of trying to figure out ETL every time a new data source comes in
or becomes available. Get the data in a sharded
flat file format on a SAN as close to the raw format as possible. Record
the providence and origin. Call this your authoritive data source. As
use cases are required, move subsections of the data into appropriate
technologies or data structures for analysis.
Keep the extract, transform, and load code snippets in an open and well labeled repository to avoid re-doing this work and to provide a path for
duplication/verification of success.
It may be surprising that most of my dream criteria deal with how people interact with each other. You'll probably notice from my other blog posts that I focus a lot on that. People are the most important thing every time. If they aren't working well together it doesn't matter how good your technology or physical environment is.