Monday, November 26, 2012

Sharing Data for Analysis - appropriate standards and metadata

This was originally posted on Data Tactic's Blog.

Sharing Data in 2013 and beyond
Selecting appropriate levels of standardization and the challenges of metadata

Eric Whyne
Technical Manager
Data Tactics Corporation
November 2012

Efforts at sharing data within the US Department of Defense and Intelligence Community, especially on the classified networks, have historically focused on rigid edicts to standardize how data is stored and accessed.  The logic is easy to follow: if data is stored in a standardized manner then we all can write software to the standards.  If there’s only a few ways to access the data, it should reduce the amount of time we spend writing code to access that data.  This approach was more reasonable when sources of data were fewer and there was time to bring interfaces into compliance before the data was published to a wider audience.  This way of thinking, unfortunately, persisted into the fast paced mission focused explosion of data created by the post 9/11 wars in the Middle East, OEF and OIF.  Almost overnight, timeliness became far more important than standards compliance, and rightly so; lives were on the line in real time.  Some national organizations stuck to the old at the cost of efficiency and some organizations adapted well to the new way of thinking.  Some other national organizations, due partly to internal opposing views on this, found their way in the middle ground.  Names are withheld to protect the guilty.  In the meantime, important data continued to be generated, stored, and shared by individuals and organizations both inside and outside of the intended audiences of the standards.   It is my intent to provide a thoughtful discussion of the more important aspects of data sharing and provide useful information for engineering decision makers thinking about undertaking a data sharing architecture, caring for and feeding one, or involved with planning a data sharing system.

When it was expected that data standards would be adhered to and kept stable a data ingestion approach known as Extract Translate, and Load (ETL) was widely adopted.  ETL required knowledge of data formats before it could be brought into the system and made accessible.  Staging areas required awareness of the data in order to maintain its fidelity.  This created a problem when data formats changed (as they always do) or data was published that was different than the standard.  Things as simple as changing an integer value to a decimal (float) could throw a wrench into data sharing causing days worth of lost data until the problem was identified and fixed.  Try to address this for hundreds of different data sources and it’s easy to see how O&M costs for data sharing architectures can skyrocket and quality can greatly suffer.    

An ETL cycle typically has the following steps:
1.                   Initiation (establish requirement, need to know, and memorandums of understanding)
2.                   Evaluate and Reference
3.                   Extract (from sources in native format)
4.                   Validate
5.                   Transform (clean, apply mission logic, check for data integrity, create aggregates or disaggregates, normalize or denormalize)
6.                   Stage (in typed value tabular staging tables)
7.                   Quality Control
8.                   Load (to target systems)
9.                   Archive (staged data, to ensure providence and for quality control baseline)
10.               Clean up

Rigid standardization of data publication not only increases cost, it reduces the quality of data, prevents it from being shared by raising overhead costs, creates barriers of entry for promising new data collection, and is antithetical to the core tenants of “big data”.  Big data architectures have been designed from the ground up to deal with poorly formatted data from heterogeneous systems.  There were two main enablers for this.  The first enabler was the mountain moving power that distributed computing provided.  Moving computation to the data stores meant that we could spend more processor cycles evaluating each record and still keep systems scalable.  The second enabler was how widely accepted generic untyped data storage formats became.  I know, one moment I’m talking about how standards can be bad and the next I’m extolling the virtues of standardization; bear with me.  Some standardization is good.  XML and JSON provided a way to store data that could be sharded, was not strongly typed, and was completely extensible.  By sharding, I mean that rather than requiring all records to have the same number of fields of a certain data type, we could publish records with only the fields that were applicable and in whatever data type made sense.  In this manner we can generically store data without knowing much about it.  We could figure out how to use it later and had the processor cycles to do it in a scalable manner thanks to distributed processing.  Have floats or integers?  Not a problem!  Put whatever in the “height” attribute and we can figure it out later.  The “store now” and “process later” is how you deal with big data and lets us sidestep some nasty formatting surprises always lurking in large sets of data.  There has been no widely accepted term for this “store first then use” method yet.  Some papers and articles have reference it as ELT (Extract, Load, Translate) others have described it as a SMaQ method (Store Map and Query).  Regardless of what it’s called, the concepts remain the same.

An ELT or SMaQ cycle typically has the following steps:
1.                   Initiation (establish requirement, need to know, and memorandums of understanding)
2.                   Extract (from sources in native format)
3.                   Store persistently (on a distributed file system in a denormalized fashion)
4.                   Transform (index, apply logic, check data integrity, create aggregates or disaggregages)
5.                   Repeat step 4 ad infinitum as you find out new ways to get value from the data.
6.                   Quality Control

I’ve listed 4 fewer steps with this new way of doing things.  Let’s talk about why some of the steps are absent.  “Evaluation and Reference“ is not needed because we don’t need to configure tabular relational data stores to accept strongly typed values.  We don’t care.  Because we get rid of that configuration overhead, there’s no need to conduct validation on the data.  Poorly formatted data from the other system won’t break us.  If they accidentally publish text where a number should be we’re fine, our architecture will still store it in whatever form it was published in and we’ll find it during our later checks on data integrity then decide on a course of action.  Having the system break or ignore poorly formatted data is unacceptable.  Often times poorly formatted data turns out to be an update to the previous publishing standard that provides more or better information.  A simple example would be changing an integer to a decimal to increase fidelity of a ­­­­measurement.  If they start publishing decimal values that are more accurate, we want those!  Since we go right to persistent storage, we can skip the staging requirement. Our persistent storage is distributed and as a consequence we gain a level of fault tolerance built in to most distributed systems.  This combined with our ability to crunch massive amounts of data means that archiving gets dropped off as a needed step as well.  Since we avoid having to validate and stage there is no need to have separate cleanup or garbage collection step.  I’m sure that in certain cases there are good reasons to add or remove step, but in general the “store now” and “figure it out later” way of doing things reduces the amount of up front work for sharing data and allows us to focus on the most important work of extracting value from the data and applying it to our mission.

Back to the DoD.  The Department of Defense Discovery Metadata Specification (DDMS) was created in support of the DoD Net-Centric Data Strategy.  The first standard was released in 2003 with multiple changes per year ever since (and everybody pretended standards would be stable).  It has lofty goals and specifies the attributes (fields) that should be used to describe any data or service that is made known to DoD Enterprise.  I’ve heard rumors of it working great, but have yet to experience a useful and working implementation of it when organizations claims of adoption or increased efficiency are put to the test.  The inability of a strict metadata standard to fix everything is not a statement about the quality of the standard or the competence of the authors.  The metadata approach itself is intrinsically wrought with difficult and sometimes insurmountable problems.  The requirement to standardize metadata amplifies those problems and creates new ones.  Although the frequent changes to the standard have certainly caused problems as time is wasted “catching up” in system implementation, locking the standard or working very hard to try and fix problems by modifying the standard again is not the right answer.  Any approach to comprehensively standardize metadata across heterogeneous systems can’t work.  This statement alone can cause heart palpitations and weeping when said in front of the wrong people, so be careful when repeating these ideas.  It’s common sense that “standardizing” is a good thing; to go against that wisdom you need to approach it’s disciples with patience and build your arguments for common sense engineering over time.  Two years prior to the DDMS standard being published, Cory Doctorow popularized the term Metacrap with his essay titled “Metacrap: Putting the torch to seven straw-men of the meta-utopia”  .  What the article lacks in tact it makes up in eloquence.  The essay discusses the following obstacles to reliable metadata.  I’ve summarized here and provided some notes in the context of the US Department of Defense and Intelligence Community technology community.

People are lazy: Populating metadata isn’t their core mission, leave it blank.
People are stupid: People will still misspell classification markings even when there are compelling reasons to get it right.
Mission Impossible, know thyself:  Ever watch a program fill out a survey about itself?  It’s always glowingly positive.  Weird.  Think people are going to be accurate about the quality of their own data?  Think again.
Schemas are not neutral: People have dedicated entire careers to specifying taxonomies, ontologies, and schemas for various aspects of intelligence and warfare.  When they retire, somebody else starts from scratch with their own idea of how it should be done with their own biases derived from their perfectly valid experiences in a different part of the industry.
Metrics influence results: Moving decimal points around or changing units of notation destroys any hope of automating queries across data sets without manual intervention to correct these mistakes.
There's more than one way to describe something: Reasonable people can disagree forever on how to describe something. Was this document the result of a questioning, an interview, or an interrogation?
People lie: Or exaggerate their claims of accuracy or success.  This happens mostly through unintentional bias, but sometimes intentionally.
Data may become irrelevant in time: The language and things important to analysis evolves as the approaches to the problems change.
Data may not be updated with new insights: Modifying shared data records means that the original information is lost.   Often times pushing updates to the authoritive or original source is not possible.

These obstacles compound with the baseline overhead of learning and addressing the standard before publishing or sharing data to create insurmountable requirements for sharing data in accordance with the standard.  This is, of course, a serious problem for smaller programs; but, even many larger programs have not been able to get close to functioning implementations. 
It’s easily understood that our unique programs produce data that is different in quantity, format, quality, and comprehensiveness.   Perhaps less intuitively obvious is that we also all use data differently.  This means that somebody else’s standard really just exposes an implementation of that data that nobody else really cares about.  A tactical user populating a situational awareness tool has a drastically different use case than an analyst constructing a network diagram.  Each of these uses requires the data to be stored and indexed in different manners not just for performance but just to make access to it computationally feasible.  

Does all of this mean that standardization and metadata should be abandoned?  Of course not!  But a cautious level of prudence needs to be exercised.  When planning your next information sharing architecture, try loosening the standardization reigns a little up front.  It will reduce your O&M wasted time, increase the providence and quality of your data, and free you up for success later as innovative folks find new ways to derive value from the data in it’s natural state.