Monday, July 7, 2014

curation is not preparation

I was on an email thread where somebody mentioned spending 80% of their time on data curation. I wrote this in response.

I think there's a semantic error we're drifting into. I think the original thread below mixes up the activity of curation and data preparation. Data prep does take a lot of time, but curation is a different thing. I'll try to describe what I'm thinking here. Curation is more of an act of protecting integrity for later use, analysis, or reproducibility. Like a museum curates. The tools for data preparation are awesome and very useful, but move beyond what is just a curation activity which I'd describe as something more fundamental.

data curation != data preparation

Provenance is always paramount when making decisions off of data. That's why trustworthy persistent storage of the raw data is most important. Any transform should not only be documented but should be reproducible.  Canonical elements of a curation system would be the raw data storage (original formats and as attained), transform code, and documentation.

Enabling discovery (locating, standardized access protocols, etc...) starts getting into something beyond curation. The connotation of curation implies preserving the scientific integrity of the data. Like dinosaur bones or artifacts being curated in a museum. Some of them are on display (available via "standardized protocols" on display), but the rest are tucked away safely in a manner that doesn't taint later analysis of them. More often than not, the bones on display are actually semi-faithful reproductions of the artifacts rather than the original artifact. Same thing with data. The graph visualization (or whatever) of the data might not be technically still the same data (different format, projection, normalized, transforms, indexes, geocoding, etc...) but it's a faithful reproduction of it that we put in the display case to educate others about the data. Like a fiberglass T-rex skull tells us a lot about the real thing, it's not meant for nuanced scientific analysis. All transforms of data, especially big data, contain an element of risk and loss of fidelity which would taint later analysis. We're all so bad at transforms that we avoid using them in cases where a life is at risk (court proceedings or military intelligence analysis) processes require citation of raw data sets. A geocoding API rarely assigns location with 100% accuracy (it's usually an exception when they do), sometimes when we do things like normalize phone numbers there's an edge case that the regular expressions don't account for, things can go wrong in an unlimited number of ways (and have... I've caught citogenesis happening within military intelligence analysis several times). The only way to spot these problems later and preserve integrity of the data is to store it in it's most raw form. If we wish to provide access for others to a projection they want to build off, the best way to do it would be to share the raw, the transform code, and a document showing the steps to get to the projection. In the email below this behavior of later analysts/scientists is noted (with disdain?). It shouldn't take long to look at previous transforms and reproduce the results, if it does then those transforms weren't that reliable anyway. If those receiving the data just want to look at a plastic dinosaur skull to get an idea of it's size and gape in wonder, then sharing data projections (raw data that has undergone a transform) is fine.

When providing curated data for research or analysis, I even make it a point to keep a "control copy" of the data in an inaccessible location. That way if there is a significant finding there is a reliable way to determine that it's not because the data became tainted during an inadvertent write operation on it.

On the other end of the spectrum (which I see all the time) is the unmarked hard drive with random duplicative junk dumped on it as an afterthought to whatever project the data originated from. Although "no" is never an answer when handed something like this and certainly useful analysis can be achieved, this is below the minimum. I imagine it's like being handed evidence for a murder case in a plastic shopping bag with the previous investigator's half-eaten burrito smeared all over it.  You can sometimes make the case for a decision with it, but it's not easy and it's a dirty job trying to make sense of the mess. This is probably more the norm when it comes to "sharing" curated data in government and industry. It's ugly.

The minimum set of things needed for useful data curation:
1. raw data store (just the facts, stored in a high integrity manner)
2. revision control for transform code (transform code, applications, etc...)
3. documentation (how to use transforms, why, provenance information)

Everything beyond this could certainly be useful (enabling easier transforms, discovery APIs, web services, file transfer protocol interfaces), but is beyond the minimum for curation. Without these three basic things, useful curation breaks.