Monday, November 26, 2012

Sharing Data for Analysis - appropriate standards and metadata



This was originally posted on Data Tactic's Blog.

Sharing Data in 2013 and beyond
Selecting appropriate levels of standardization and the challenges of metadata

Eric Whyne
Technical Manager
Data Tactics Corporation
November 2012

Efforts at sharing data within the US Department of Defense and Intelligence Community, especially on the classified networks, have historically focused on rigid edicts to standardize how data is stored and accessed.  The logic is easy to follow: if data is stored in a standardized manner then we all can write software to the standards.  If there’s only a few ways to access the data, it should reduce the amount of time we spend writing code to access that data.  This approach was more reasonable when sources of data were fewer and there was time to bring interfaces into compliance before the data was published to a wider audience.  This way of thinking, unfortunately, persisted into the fast paced mission focused explosion of data created by the post 9/11 wars in the Middle East, OEF and OIF.  Almost overnight, timeliness became far more important than standards compliance, and rightly so; lives were on the line in real time.  Some national organizations stuck to the old at the cost of efficiency and some organizations adapted well to the new way of thinking.  Some other national organizations, due partly to internal opposing views on this, found their way in the middle ground.  Names are withheld to protect the guilty.  In the meantime, important data continued to be generated, stored, and shared by individuals and organizations both inside and outside of the intended audiences of the standards.   It is my intent to provide a thoughtful discussion of the more important aspects of data sharing and provide useful information for engineering decision makers thinking about undertaking a data sharing architecture, caring for and feeding one, or involved with planning a data sharing system.

When it was expected that data standards would be adhered to and kept stable a data ingestion approach known as Extract Translate, and Load (ETL) was widely adopted.  ETL required knowledge of data formats before it could be brought into the system and made accessible.  Staging areas required awareness of the data in order to maintain its fidelity.  This created a problem when data formats changed (as they always do) or data was published that was different than the standard.  Things as simple as changing an integer value to a decimal (float) could throw a wrench into data sharing causing days worth of lost data until the problem was identified and fixed.  Try to address this for hundreds of different data sources and it’s easy to see how O&M costs for data sharing architectures can skyrocket and quality can greatly suffer.    

An ETL cycle typically has the following steps:
1.                   Initiation (establish requirement, need to know, and memorandums of understanding)
2.                   Evaluate and Reference
3.                   Extract (from sources in native format)
4.                   Validate
5.                   Transform (clean, apply mission logic, check for data integrity, create aggregates or disaggregates, normalize or denormalize)
6.                   Stage (in typed value tabular staging tables)
7.                   Quality Control
8.                   Load (to target systems)
9.                   Archive (staged data, to ensure providence and for quality control baseline)
10.               Clean up

Rigid standardization of data publication not only increases cost, it reduces the quality of data, prevents it from being shared by raising overhead costs, creates barriers of entry for promising new data collection, and is antithetical to the core tenants of “big data”.  Big data architectures have been designed from the ground up to deal with poorly formatted data from heterogeneous systems.  There were two main enablers for this.  The first enabler was the mountain moving power that distributed computing provided.  Moving computation to the data stores meant that we could spend more processor cycles evaluating each record and still keep systems scalable.  The second enabler was how widely accepted generic untyped data storage formats became.  I know, one moment I’m talking about how standards can be bad and the next I’m extolling the virtues of standardization; bear with me.  Some standardization is good.  XML and JSON provided a way to store data that could be sharded, was not strongly typed, and was completely extensible.  By sharding, I mean that rather than requiring all records to have the same number of fields of a certain data type, we could publish records with only the fields that were applicable and in whatever data type made sense.  In this manner we can generically store data without knowing much about it.  We could figure out how to use it later and had the processor cycles to do it in a scalable manner thanks to distributed processing.  Have floats or integers?  Not a problem!  Put whatever in the “height” attribute and we can figure it out later.  The “store now” and “process later” is how you deal with big data and lets us sidestep some nasty formatting surprises always lurking in large sets of data.  There has been no widely accepted term for this “store first then use” method yet.  Some papers and articles have reference it as ELT (Extract, Load, Translate) others have described it as a SMaQ method (Store Map and Query).  Regardless of what it’s called, the concepts remain the same.

An ELT or SMaQ cycle typically has the following steps:
1.                   Initiation (establish requirement, need to know, and memorandums of understanding)
2.                   Extract (from sources in native format)
3.                   Store persistently (on a distributed file system in a denormalized fashion)
4.                   Transform (index, apply logic, check data integrity, create aggregates or disaggregages)
5.                   Repeat step 4 ad infinitum as you find out new ways to get value from the data.
6.                   Quality Control

I’ve listed 4 fewer steps with this new way of doing things.  Let’s talk about why some of the steps are absent.  “Evaluation and Reference“ is not needed because we don’t need to configure tabular relational data stores to accept strongly typed values.  We don’t care.  Because we get rid of that configuration overhead, there’s no need to conduct validation on the data.  Poorly formatted data from the other system won’t break us.  If they accidentally publish text where a number should be we’re fine, our architecture will still store it in whatever form it was published in and we’ll find it during our later checks on data integrity then decide on a course of action.  Having the system break or ignore poorly formatted data is unacceptable.  Often times poorly formatted data turns out to be an update to the previous publishing standard that provides more or better information.  A simple example would be changing an integer to a decimal to increase fidelity of a ­­­­measurement.  If they start publishing decimal values that are more accurate, we want those!  Since we go right to persistent storage, we can skip the staging requirement. Our persistent storage is distributed and as a consequence we gain a level of fault tolerance built in to most distributed systems.  This combined with our ability to crunch massive amounts of data means that archiving gets dropped off as a needed step as well.  Since we avoid having to validate and stage there is no need to have separate cleanup or garbage collection step.  I’m sure that in certain cases there are good reasons to add or remove step, but in general the “store now” and “figure it out later” way of doing things reduces the amount of up front work for sharing data and allows us to focus on the most important work of extracting value from the data and applying it to our mission.

Back to the DoD.  The Department of Defense Discovery Metadata Specification (DDMS) was created in support of the DoD Net-Centric Data Strategy.  The first standard was released in 2003 with multiple changes per year ever since (and everybody pretended standards would be stable).  It has lofty goals and specifies the attributes (fields) that should be used to describe any data or service that is made known to DoD Enterprise.  I’ve heard rumors of it working great, but have yet to experience a useful and working implementation of it when organizations claims of adoption or increased efficiency are put to the test.  The inability of a strict metadata standard to fix everything is not a statement about the quality of the standard or the competence of the authors.  The metadata approach itself is intrinsically wrought with difficult and sometimes insurmountable problems.  The requirement to standardize metadata amplifies those problems and creates new ones.  Although the frequent changes to the standard have certainly caused problems as time is wasted “catching up” in system implementation, locking the standard or working very hard to try and fix problems by modifying the standard again is not the right answer.  Any approach to comprehensively standardize metadata across heterogeneous systems can’t work.  This statement alone can cause heart palpitations and weeping when said in front of the wrong people, so be careful when repeating these ideas.  It’s common sense that “standardizing” is a good thing; to go against that wisdom you need to approach it’s disciples with patience and build your arguments for common sense engineering over time.  Two years prior to the DDMS standard being published, Cory Doctorow popularized the term Metacrap with his essay titled “Metacrap: Putting the torch to seven straw-men of the meta-utopia”  http://www.well.com/~doctorow/metacrap.htm  .  What the article lacks in tact it makes up in eloquence.  The essay discusses the following obstacles to reliable metadata.  I’ve summarized here and provided some notes in the context of the US Department of Defense and Intelligence Community technology community.

People are lazy: Populating metadata isn’t their core mission, leave it blank.
People are stupid: People will still misspell classification markings even when there are compelling reasons to get it right.
Mission Impossible, know thyself:  Ever watch a program fill out a survey about itself?  It’s always glowingly positive.  Weird.  Think people are going to be accurate about the quality of their own data?  Think again.
Schemas are not neutral: People have dedicated entire careers to specifying taxonomies, ontologies, and schemas for various aspects of intelligence and warfare.  When they retire, somebody else starts from scratch with their own idea of how it should be done with their own biases derived from their perfectly valid experiences in a different part of the industry.
Metrics influence results: Moving decimal points around or changing units of notation destroys any hope of automating queries across data sets without manual intervention to correct these mistakes.
There's more than one way to describe something: Reasonable people can disagree forever on how to describe something. Was this document the result of a questioning, an interview, or an interrogation?
People lie: Or exaggerate their claims of accuracy or success.  This happens mostly through unintentional bias, but sometimes intentionally.
Data may become irrelevant in time: The language and things important to analysis evolves as the approaches to the problems change.
Data may not be updated with new insights: Modifying shared data records means that the original information is lost.   Often times pushing updates to the authoritive or original source is not possible.

These obstacles compound with the baseline overhead of learning and addressing the standard before publishing or sharing data to create insurmountable requirements for sharing data in accordance with the standard.  This is, of course, a serious problem for smaller programs; but, even many larger programs have not been able to get close to functioning implementations. 
It’s easily understood that our unique programs produce data that is different in quantity, format, quality, and comprehensiveness.   Perhaps less intuitively obvious is that we also all use data differently.  This means that somebody else’s standard really just exposes an implementation of that data that nobody else really cares about.  A tactical user populating a situational awareness tool has a drastically different use case than an analyst constructing a network diagram.  Each of these uses requires the data to be stored and indexed in different manners not just for performance but just to make access to it computationally feasible.  

Does all of this mean that standardization and metadata should be abandoned?  Of course not!  But a cautious level of prudence needs to be exercised.  When planning your next information sharing architecture, try loosening the standardization reigns a little up front.  It will reduce your O&M wasted time, increase the providence and quality of your data, and free you up for success later as innovative folks find new ways to derive value from the data in it’s natural state.

Monday, November 12, 2012

Stories of survival in tough circumstances


Six years ago I read the book "Life of Pi" on the recommendation of my wife's cousin.  It turned out to be a pretty good read.  With the movie coming out on the 21st, I'd recommend picking it up.  I can't say that I'm eager to see the movie, but it's always good to read the book before too many of the advertisements hit or people start talking about it.
http://en.wikipedia.org/wiki/Life_of_Pi

The book got me on a "stranded at sea" or "survival in extreme circumstances" reading kick.  If you're into that kind of thing here's a list of titles I'd recommend.

"Endurance"
http://en.wikipedia.org/wiki/Endurance:_Shackleton%27s_Incredible_Voyage
If you haven't heard of the story of Shackleton and crossing the south pole, this book will keep you on edge the whole way through.  There is another book on the topic called "South" that is not as well written (in my opinion) as "Endurance" is.  This is an epic story that makes the little daily trials of life seem silly in comparison.
http://en.wikipedia.org/wiki/Ernest_Shackleton

"In the Heart of the Sea"
http://en.wikipedia.org/wiki/In_the_Heart_of_the_Sea:_The_Tragedy_of_the_Whaleship_Essex
I read this book just after I read "Endurance" by Alfred Lansig.  Shackleton and his crew were so competent that it made me feel sorry for this crew which seemed to have poor leadership.  The book has good reviews and it deserves them. The discussions of their plights while adrift were engrossing.

"Moby Dick"
The book isn't as much about the survival topic as the other books.  The details of whale hunting were fun to read.  It's a literary classic and if you had to read it while in school I'd still pick it up again if you have the time.  Expect to take a few weeks to get through it, and get the digital version because the paper version is heavy.  The sea adventure ends on a chapter that is more written as poetry than action which I thought was disappointing.

"Into the Wild"
http://en.wikipedia.org/wiki/Into_the_Wild_%28book%29
I originally read this back in 2003, but I re-read it during this kick.  There was a movie which came out in 2007.  The book is still worth reading even if you did see the movie because of how well it documents other "lost in the wild" tales while telling McCandless's story. 

Just last year I read "Unbroken" which after reading I'd put into this group of books as well.
http://en.wikipedia.org/wiki/Unbroken:_A_World_War_II_Story_of_Survival,_Resilience,_and_Redemption
Another Amazing WW2 story.  Follow Zamperini as he competes in the Olympics in Germany before WW2 breaks out and he undergoes flight training in the southern pacific.  The real meat of the book is when his plane malfunctions and he ends up as a Japanese POW for the remainder of the war.

Speaking of WW2 survival books... "With the Old Breed"
http://en.wikipedia.org/wiki/With_the_Old_Breed
I'm glad I read this book way after I got done with my combat tours.  I thought we had it rough, but WW2 Marines in the Pacific had it way tougher.  War was hell in the truest sense of the word in this book.  Follow Eugene Sledge who turned the scribbles he wrote in the margin's of his bible during the pacific campaign into an amazing book.  I later found out that "The Pacific" miniseries used this as one of their reference memoirs so I watched it.  It wasn't nearly as emotionally stirring as the book.



Thursday, November 8, 2012

The sun revolves around the earth

Sometimes it's useful to remember the mistakes of the past in the context of today and exactly how tough it is to learn and know something.  Those that thought the sun revolved around the earth were just as smart as we are today.  It was only after careful study of the nuanced motion of the light dotting the night time sky that we were able to differentiate that some were planets and that some of them exhibited apparent retrograde motion. (Image to the right is an illustration of apparent retrograde motion.)  Then, having made those observations, creating descriptions of the experiments which were repeatable to make the case to others and spread the knowledge.

The powers of data analysis brought to us by modern computing technology opens up a whole new class of problems to new levels of scrutiny.  (The image to the left is an illustration of the four color theorem proved by computer in 1976.) Interestingly, there are still philosophical objections to mathematical proofs that depend partially on exhaustive computer powered computation.  The main argument being that humans can't verify it logically so we can't trust it as proven.  It requires a replacement of logical deduction with faith in the ability of the computer to accurately operate.  I think the concept of black-box advancement of science is interesting.  If you approach it with a willingness to engage in the process, risks of computer error in computer assisted proofs can be mitigated by using heterogeneous systems to replicate the proof.  Varieties of architectures and software is something we don't have a shortage of.

Distributed processing and the promise of cloud based computation (buzz words aside) bring even more capability and variety to computational proofs.  Being involved in this advancing field, I haven't seen anybody really engaging with this.  We are barely scratching the surface of what's possible with making computers work together.  The next few years promise to be an exciting time.

Sunday, November 4, 2012

Ruins

This is a follow up to my previous post when I mentioned the Orson Scott Card book "Pathfinder".  Last week on October 31st, the next book in that series came out called "Ruins".  I'm a few chapters into it and it's even better than the last one.  If you are a sci-fi fan, it's definitely worth picking up.

9 Nov 2012 Update: Finished the book.  Great read.  Annoyed at the ending.  There's going to be more books to the series.  Card is probably typing away at his keyboard on the next one as I'm writing this.