Wednesday, July 10, 2013

How good schemas go bad

I've advocated in the past on this blog for extensible formats such as JSON, XML, and even CSV when exchanging data for analysis. As a contrast to that, I know that it's sometimes nice to have data loaded into a database management system. They are a convenient way to run generic queries and interact with the data in a responsive way. Because it's convenient and functionally flexible it would seem like a great idea to try and use that database management system to handle a cross organization exchange of data. However using exported databases for data exchange can cause some unexpected problems. They aren't necessarily problems with the technology alone, its when human nature and project time constraints collide with exchanging data in those formats when the most of the problems arise.

Here's a scenario: You are a data analyst working at organization A which has invested money in Oracle or MSSQL or SAS or whatever. There is investment in both licenses and time in training people to be good with that certain database technology, it has probably even driven hiring practices for a few years. You're required to receive a data set from Organization B which has been provided as an export from a different database technology. The first thing you're going to have to do is to import that data. You'll be doing this with personnel that are not intimately familiar with the native data types or the functionality of the original database. This can lead to a quagmire of problems itself, but here's the real kicker, different database technologies have similar but sometimes different data types. You''ll have to adjust schemas to create the tables to ingest the data. This is usually a smooth process the first time it happens with a data set. These changes always seem to trend toward selecting larger and larger data types. The second or third sharing of the same information and the slight adjustments that are made to schemas between databases will lead to less storage and access efficiency.
To describe how and why this happens, first a high-level side-bar discussion on how data is stored in a database. In order to ensure efficiency of access database management systems will typically ask you to specify the amount of space each bit of data will potentially take up in the database. This ensures everything can be stored in-line and accessed quickly since traversal of the data stored on disk can be done in known increments as opposed to having to determine the size of every single record traversed. It also enables the fast calculation of arbitrary locations. Want to get to the middle of the data table (the first action in a binary sort computation) just take the number of rows, divide by two, multiply by the known cell sizes (thanks to our defined schema) and you know where to start. There are circumstances in which databases can store arbitrarily large data and not be confined to a limit defined in the table schema. But they are stored external to the tables and are very inefficient when returning multiple rows, these are the CLOB and BLOB data types. Since the data is stored separate from the indexes it can be as big as is needed. However conducting operations on arbitrarily large data in a database is much more time consuming and best avoided whenever possible.

Back to the problems associated with sharing data. Let's say that database A has a varchar data type (short for variable length character data) and the table being shared has a varchar cell size that is larger than the database you're trying to import the data into. In this case, the easy and lazy thing to do is to figure out whatever data type on the new database will fit the largest possible value in the data as defined by this schema. If it's a varchar 4000, the next size up on most databases has you into CLOBS. This transition is the most problematic for efficiency. But you can see how through these types of compromises, as data is shared between different database technologies how schemas can become skewed as data types are changed to accommodate schemas from different databases.

The way to avoid this is to use schema neutral and extensible data sharing formats. But, aside from just being schema neutral during transfer, what they really do is require the humans doing the ingest to conduct a sort of study of the data before they ingest it into their own database technology. There's no 4000 character limit assigned by the schema that may or may not exist in the data. If you want to know what the maximum record size is, you need to take a look at the records, find the biggest one and then create your tables with that in mind. You're forced to find the optimal data storage schema for the data. In a lot of cases, the schema for the population data set is dramatically different than the data being shared. Typically organizations don't share entire data stores, just the results of queries that return subsets of the data.

Other side effects avoided by sharing in extensible formats include the following. Ingest routines for database export tables are often brittle processes. It doesn't take much of an error to cause a catastrophic failure that means you don't have access to any of the data they are trying to share. With extensible formats, if somebody cuts off the transfer half way through or random storage errors sneak in, it's not a problem. You lose a few records, but still have something to work with. I can say with first hand experience and after reaching to the 30 or so DBAs that I know that there are no good forensics tools for some proprietary database export files. If it's corrupt, you're done unless you want to spend the money reverse engineering the binary file structure. In some cases going back to the originating organization is prohibitively difficult or impossible. In the best case scenario your project schedule slips by a few days, still a heavy cost to pay. Another avoided problem is the transfer of application logic. It needs to be explicitly done outside of the data store. Whether it be a human readable export or an actual human typing it into a document. There's no assumption that relationships in the data are represented in the data store. Even though this is rare with database technologies, it always seems like people assume that when you export a database the application logic just goes with it. Again, human error.


Extensible formats are always the best means for sharing data across organizations with heterogeneous information systems. But if you do receive data in a non-extensible format, don't assume the schema is descriptive of the data before you start adjusting it to your preferred database technology. Take the time to analyze and ingest in a sensible manner. If you avoid this step you're pushing the problem to the next person and dooming yourself with sub-optimal performance on your own systems.

The Criminal