Sunday, February 10, 2013

Categories of numbers from data


I think some programmers approach numbers in the wrong way when solving data problems. 

Wikipedia hosts a comprehensive list of types of numbers. They are all important, although when implementing code we rarely get to think in these terms. Numbers and their relationship to computers is a shorter list. This is where most programmers spend the majority of their time thinking and working. Here are the big three:
  • Integers
  • Floats
  • Bignum or Arbitrary precision
The distinction of each is in how these numbers are treated in the memory of the computer. Different languages have different implementations. Know that integers and floats have fixed sizes and using a bignum is less efficient. If you put too big of a number in fixed size memory it will be concatenated or run over into adjacent memory and cause problems. Dealing with bignum or arbitrary precision numbers will slow down how fast your program runs.

I have an affinity for untyped and weakly typed programming languages because it allows me to spend more time in a different frame of mind. If you're not familiar with the terms: it doesn't mean that they are spoken word or that you hit the keys lighter, it has to do with how variables are declared. In an untyped language you just mostly just put data in the variables you create and the compiler figures out most of the details about whether to treat the data as a string, an integer, a float, or whatever. It makes writing code more enjoyable since only in a small set of circumstances do I ever really care about how that stuff is handled any way. Most untyped languages have ways of telling the computer specifically what to do when it matters.

This post is about that different frame of mind. I wanted to discuss a way of thinking about numbers that I rarely see addressed in literature or discussion that I think is important. It deals with how we need to think about these numbers when they appear in data. In my mind, numbers from data fall into these three categories.
  • ordinal
  • numeric (including cardinal)
  • nominal (aka categorical)
Statisticians care deeply about these categories. Some of my friends guffaw at how common sense dealing with numbers in this manner is; and they are right. You'll find them prominently listed in statistics literature, but not so much in computer science or programming books. I think they are important to computer science but their significance is largely ignored. As we approach data problems with computers, this categorization becomes more important.

A brief description of each:

Ordinal numbers, although they might not appear or be sorted into an order in the data, can be put into a sorted order. What's missing from ordinal numbers is the relative size or degree of the difference. They are most commonly represented by integers, but don't let that fool you.  They can be represented by other things as well even in some cases letters. An example of an ordinal number would be the ranking of winners in a race. By looking at first and second place we know who is faster, but just going off of those numbers we can't tell how much faster.

Numeric numbers are similar to ordinal numbers, but give clues about relative differences. If we arranged a set of objects by weight their position would be ordinal and the weight would be numeric.

Categorical numbers simply serve as a unique identifier or an identifier of some category. A common example of these types of numbers is a part number or serial number on a piece of equipment. You can really get away with treating these as text in most circumstances. In real world data, people have a habit of sneaking in other characters when you don't expect them anyway (like dashes). Real world categorical numbers also tend to be multi-layered. A great example of this is phone numbers. The first three digits give information about a geographic region. If it's a cellular number from the United States, it was probably the geographic region the subscriber was at in 2005. I like that when phone numbers appear whole in a normalized relational database it creates a micro-instance of a denormalized data structure (don't tell your 1970 to late 1990s relational database professor, this was a sin back then). It doesn't really matter because of the context they usually appear in, but I always enjoyed that little nuance. In a future post I might discuss denormalized vs normalized structures and when they are both useful. It's probably one of the biggest decisions we make when designing an approach to a data problem.

I think approaching data problems by looking at numbers with these attributes in mind first is more productive than thinking about whether or not things will fit in memory. Too often I've seen consternation and wasted time as people try to force categorical numbers into a mechanism designed for integers. There's no need. Thinking about numbers in terms of ordinal, numeric, and categorical properties make those warehousing and analysis questions easier.