Thursday, May 21, 2015

i before e except after c

Somebody misspelled the word "weird" in an email to me this morning. Our HBase system was acting weird and they were sending me a note. "Weirding" is a common occurrence with big data software, so this was nothing surprising.

What caught my attention was that "weird" is another one of those words that violates the only grammar rule most people know.  We've all heard and memorized that "i comes before e except after c". Weird.

I believe that spell checkers have made us all dumber since we're able to outsource our thinking without really thinking about it. I've often found myself just hammering away at keys and letting the computer just generally figure out what I was trying to say. The computer is accurate and able to do do this, so we've formed sort of a symbiosis in this manner. But as a consequence I've found myself embarrassingly uncertain of my self when hand writing letters or notes with pen and paper. So I've tried to slow down and eschew spell checking systems before I become any more incompetent. Now I'm trying to pay attention to the spelling of words.

So how many words violate this rule?

Here's the wikipedia page:
http://en.wikipedia.org/wiki/I_before_E_except_after_C

If we scroll down to the Exceptions section we see four violations of the "cie" part of the rule listed. They are all words I'd never use, so that's not helpful and doesn't seem comprehensive. There's no real numbers anywhere in this article to look at. Maybe we can do better.

My next stop was here:
http://www-01.sil.org/linguistics/wordlists/english/
It was just the first page I found that had a list of english words. There are about 100,000 of them in a nice text file.

Grab the file

eric@glamdring:~/workspace/words$ wget http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt

And count some stuff

eric@glamdring:~/workspace/words$ grep ie wordsEn.txt | wc -l
6317
eric@glamdring:~/workspace/words$ grep ei wordsEn.txt | wc -l
1010
eric@glamdring:~/workspace/words$ grep cei wordsEn.txt | wc -l
88
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | wc -l
322
eric@glamdring:~/workspace/words$


Hang on a minute... "i before e, except after c". That's strange that there's more occurrences of "cie" (322) than there are of "cei" (88).

A quick look tells us why:
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | head
abbacies
abbotcies
aberrancies
abeyancies
abortifacient
absorbencies
accuracies
adamancies
adequacies
advocacies

It looks like there are a lot of occurrences of a popular suffix "ies". A quick trip to the wikipedia page about suffixes.
http://en.wikipedia.org/wiki/Suffix
So....... despite being used 6 times on the wikipedia page "ies" isn't listed as a suffix.  That's frustrating.

More searching and there's a page about it on wiktionary:
http://en.wiktionary.org/wiki/-ies

Let's filter those out.

eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | wc -l
103

Not bad. That's small enough of a list to take a look at. But I have a hunch "science" will show up a  bunch of times, since that's one of the exceptions I remember. And hey, we're not being very scientific anyway, so let's get rid of that too.

eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | grep -v science | grep -v scientific | wc -l
85

Not too many. Here's what's left.

eric@glamdring:~/workspace/words$ dos2unix wordsEn.txt
dos2unix: converting file wordsEn.txt to Unix format ...
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | grep -v science | grep -v scientific | tr '\n' ' '
abortifacient ancien anciens ancient ancienter ancientest anciently ancientness ancients bioscientist boccie bouncier calefacient chancier coefficient coefficients concierge concierges conscientious conscientiously conscientiousness deficiency deficient deficiently delirifacient dicier efficiency efficient efficiently facie fancied fancier fanciers financier financiers fleecier flouncier geoscientist geoscientists glacier glaciered glaciers hacienda haciendas icier inefficiency inefficient inefficiently insufficiency insufficient insufficiently intersocietal jouncier juicier lacier lanciers liquefacient mincier nescient nescients objicient omniscient omnisciently overconscientious prescient pricier proficiency proficient proficiently racier saucier scientist scientistic scientists societal societies society specie spicier stupefacient sufficiency sufficient sufficiently unconscientious unconscientiously

Notice I had to use dos2unix. Windows and a few other programs really dork up newline characters, which makes a lot of transforms involving newlines not work. In this case I had to convert it so I could change newlines into spaces.

But back to weird... what's the deal with that category of exceptions.

Actually, nope. Times up. I'm done with my coffee and about to walk out to go to work, so that's where this post ends.