Thursday, May 21, 2015

i before e except after c

Somebody misspelled the word "weird" in an email to me this morning. Our HBase system was acting weird and they were sending me a note. "Weirding" is a common occurrence with big data software, so this was nothing surprising.

What caught my attention was that "weird" is another one of those words that violates the only grammar rule most people know.  We've all heard and memorized that "i comes before e except after c". Weird.

I believe that spell checkers have made us all dumber since we're able to outsource our thinking without really thinking about it. I've often found myself just hammering away at keys and letting the computer just generally figure out what I was trying to say. The computer is accurate and able to do do this, so we've formed sort of a symbiosis in this manner. But as a consequence I've found myself embarrassingly uncertain of my self when hand writing letters or notes with pen and paper. So I've tried to slow down and eschew spell checking systems before I become any more incompetent. Now I'm trying to pay attention to the spelling of words.

So how many words violate this rule?

Here's the wikipedia page:
http://en.wikipedia.org/wiki/I_before_E_except_after_C

If we scroll down to the Exceptions section we see four violations of the "cie" part of the rule listed. They are all words I'd never use, so that's not helpful and doesn't seem comprehensive. There's no real numbers anywhere in this article to look at. Maybe we can do better.

My next stop was here:
http://www-01.sil.org/linguistics/wordlists/english/
It was just the first page I found that had a list of english words. There are about 100,000 of them in a nice text file.

Grab the file

eric@glamdring:~/workspace/words$ wget http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt

And count some stuff

eric@glamdring:~/workspace/words$ grep ie wordsEn.txt | wc -l
6317
eric@glamdring:~/workspace/words$ grep ei wordsEn.txt | wc -l
1010
eric@glamdring:~/workspace/words$ grep cei wordsEn.txt | wc -l
88
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | wc -l
322
eric@glamdring:~/workspace/words$


Hang on a minute... "i before e, except after c". That's strange that there's more occurrences of "cie" (322) than there are of "cei" (88).

A quick look tells us why:
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | head
abbacies
abbotcies
aberrancies
abeyancies
abortifacient
absorbencies
accuracies
adamancies
adequacies
advocacies

It looks like there are a lot of occurrences of a popular suffix "ies". A quick trip to the wikipedia page about suffixes.
http://en.wikipedia.org/wiki/Suffix
So....... despite being used 6 times on the wikipedia page "ies" isn't listed as a suffix.  That's frustrating.

More searching and there's a page about it on wiktionary:
http://en.wiktionary.org/wiki/-ies

Let's filter those out.

eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | wc -l
103

Not bad. That's small enough of a list to take a look at. But I have a hunch "science" will show up a  bunch of times, since that's one of the exceptions I remember. And hey, we're not being very scientific anyway, so let's get rid of that too.

eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | grep -v science | grep -v scientific | wc -l
85

Not too many. Here's what's left.

eric@glamdring:~/workspace/words$ dos2unix wordsEn.txt
dos2unix: converting file wordsEn.txt to Unix format ...
eric@glamdring:~/workspace/words$ grep cie wordsEn.txt | grep -v cies | grep -v science | grep -v scientific | tr '\n' ' '
abortifacient ancien anciens ancient ancienter ancientest anciently ancientness ancients bioscientist boccie bouncier calefacient chancier coefficient coefficients concierge concierges conscientious conscientiously conscientiousness deficiency deficient deficiently delirifacient dicier efficiency efficient efficiently facie fancied fancier fanciers financier financiers fleecier flouncier geoscientist geoscientists glacier glaciered glaciers hacienda haciendas icier inefficiency inefficient inefficiently insufficiency insufficient insufficiently intersocietal jouncier juicier lacier lanciers liquefacient mincier nescient nescients objicient omniscient omnisciently overconscientious prescient pricier proficiency proficient proficiently racier saucier scientist scientistic scientists societal societies society specie spicier stupefacient sufficiency sufficient sufficiently unconscientious unconscientiously

Notice I had to use dos2unix. Windows and a few other programs really dork up newline characters, which makes a lot of transforms involving newlines not work. In this case I had to convert it so I could change newlines into spaces.

But back to weird... what's the deal with that category of exceptions.

Actually, nope. Times up. I'm done with my coffee and about to walk out to go to work, so that's where this post ends. 




Saturday, February 28, 2015

Ubuntu Dual Boot on Macbook Pro

This morning I dual booted a new Macbook Pro with OSX Yosemite and Ubuntu 14.04. These are my notes. I just stepped through this a few minutes ago on a brand new computer; it worked. YMMV

Prerequisites:
Ubuntu 14.04 AMD64 bootable USB drive.
I created mine using the StartupDiskCreator creator tool included with Ubuntu after downloading the latest .iso from ubuntu.com.

Step 1: Install rEFInd. 

Link here: http://www.rodsbooks.com/refind/
Make sure it persists across reboots before moving on to next step.

Step 2: Free up some space on the hard drive. 

With Yosemite I had to revert from a logical volume to a normal partition in order to be able to resize the OSX partition. On the command line run this.

diskutil cs list

You'll see your corestorage devices listed. One entry, the last one to print, will have this tag or something like it: "Revertible:  Yes"

Take the UUID from this entry and revert it by running a command like this:

diskutil cs revert 7BF42B7B-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Shut down completely and then power on the computer. Not sure why this is necessary, but it is.

You can now use the OSX disk tool to resize the partition by dragging the corner. Leave the rest of the space empty.

Step 3: Install Ubuntu

Plug in the USB stick. When you shutdown and restart the rEFInd will have it as an option to boot into.

Step through the install screens as normal and choose to "do something else" during the disk selection step.

Add two partitons, one ext4 for root (/) and one as swap.

This is important: make sure to install the bootloader on the same partition you chose for root (/). This is available as a dropdown menu option on the partitioning screen.

Finish the install, the next time you reboot you'll see the OSX logo and a Penguin in your rEFInd boot menu. That's it.

Other notes


To get the computer to sleep properly, install the proprietary gpu drivers. This is a one click install under the "additional drivers" tool.

To right click, use two fingers while clicking.

I've found it useful to disable tap-to-click on the touchpad. This checkbox is in the mouse settings tool.



Wednesday, February 25, 2015

Praise punishment and regression to the "mean"




Using punishment as an incentive is wrong-headed.

First, let me put into context a few life events that have shaped my thinking on leadership and training. I enlisted in the United States Marine Corps when I was 17 years old. Maybe I'll go into details on why in a later post, but a few months after my 18th birthday I was on a bus to Parris Island, South Carolina. They organize transportation so that the bus arrives on the island at 3 a.m. then all hell breaks loose for the next 12 weeks; this is known as Marine Corps Boot Camp. A few years later, I was selected for a scholarship and ended up going to a private military school, Norwich University. That in itself is another story for a different time (my friends reading this that have heard the story are laughing right now). The first year attendees of Norwich go through an extended bootcamp known as Rookdom. A few years later I found myself in Quantico Virginia attending USMC Officer Candidate School, which followed an 8 month training program known as "Bulldog"; yea that was fun. A year after that, I spent six months at USMC The Basic School, which effectively is another boot camp, but with more emphasis on peer leadership. Few would argue that I've had more than my fair share of training that has traditionally made extensive use of punishment as an incentive. Based on my assertion in the first sentence of this post you might think that I had a tough time in these schools. But the truth is that I didn't; actually I excelled and enjoyed all the hard training. My first year in the Marine Corps I did so well that I earned two meritorious promotions, those successes lead into my selection for the officer candidate programs that eventually gave me the opportunities to be commissioned as a Marine Corps Officer and spend five amazing years leading Marines. You'd think having such a great experiences with punishment as incentive that I'd be all for it and giving everyone I work with a healthy dose of Drill Instructor type "constructive criticism". Those that work with me will tell you that nothing could be further from the truth.    

Steven Pressfield's excellent book "The War of Art" has a few sentences about the Marine Corps that I enjoyed.
    "There's a myth that Marine training turns baby-faced recruits into bloodthirsty killers. Trust me, the Marine Corps is not that efficient. What it does teach, however, is a lot more useful. The Marine Corps teaches you how to be miserable. This is invaluable for an artist. Marines love to be miserable. Marines derive a perverse satisfaction in having colder chow, crapper equipment, and higher casualty rates than any outfit of dog faces, swab jockeys, or fly boys, all of whom they despise. Why? Because these candy-asses don't know how to be miserable."

This puts all of the pain and abuse of "corrective" training in the Marine Corps into context in my mind. Sure, part of it is an incentive to drive yourself into the ground in the pursuit of excellence. But even those that did well still caught the abuse. The traditions of negative training in military contexts is to breed mental and physical toughness through adversity. Having experienced it all first-hand I can assure you it is not conducive to other environments.

I was walking through a restaurant several weeks ago and in a side corridor noticed a manager berating one of the waitresses over some minor offense. I don't even know what it was, but they stopped after they saw me walk up. Of course I didn't say anything, it was only a minor discussion and none of my business. But I wish I could have had a conversation with the manager and explained to them that punishment only breeds contempt, not better performance. The evidence for this is so overwhelming that you'd think anyone in a leadership position would see it as obvious. But there's an illusion at play in these scenarios that makes unobservant leaders think that punishment works.

Let's say I'm a new leader trying to figure out how to do performance management and the two tools I'm learning how to use are negative and positive reinforcement and I employ them both. The first thing I try (because I'm a good person and I want to praise people) is praise someone for doing well. They met a deadline, mopped a floor, sold something, or wrote exceptional code. Great! Happiness and celebration all around. But I notice that the person I praised usually doesn't do as well the next time. The next deadline might be missed, they missed a spot mopping, missed a sale, or the code has a bug.  What gives? Ok, next I'll try negative reinforcement. So somebody under performs and I let them have it. I give them bad performance appraisals, mention their failure, give them demerits, counsellings, or whatever. Over time, I start to notice that more often than not, when I've punished their poor performance they seem to get better! I'm not making this up! In my experience this is exactly how things work. So why wouldn't I advocate for using punishment to increase performance? Because all of this is just an illusion.


Think of performance as random and happening across a range of potential outcomes. Like a stock chart it goes up and down, so goes an individuals performance at a task. The overall trend might be up, or it might be down, and this is the real thing to watch. But look at the chart and see where we would notionally apply positive or negative reinforcement. If we apply positive reinforcement at the peaks, statistically speaking the next thing that happens is going to be less stellar than what we praised. If we apply negative reinforcement at the troughs, statistically the next task is going to be better. This happens in both scenarios regardless of whether our intervention had any impact at all leading to the illusion that negative reinforcement increases performance. It's a bias that needs to be overcome. In statistics and sociology this concept is called regression to the mean and it's a well studied phenomenon which can be summarized as such: "if a variable is extreme on its first measurement, it will tend to be closer to the average on its second measurement".

So the next time you find yourself intuitively thinking that "punishment works"; take a step back and remind yourself of this illusion. It really does pay to be good to everyone, even though at a superficial level it doesn't seem to work as well.



Wednesday, February 18, 2015

Small fans suck


It's 3am on a Monday morning. My son Colton was up because he hasn't been feeling well. My amazing wife managed to get him back to sleep after some tender loving care, seriously she is awesome. But I'm up for the day; so I wandered downstairs, made some coffee, and got on my computer to get some work done. As soon as I sat down, the serene silence of my morning started being cut to shreds by the buzzing of a fan coming from my computer. It had started several days ago but I haven't had the time to fix. Next thing I know, I'm shaving a  yak.

First some background before I get into detail about how I screwed up and what I did to fix it. A while ago I spent several weeks being obsessive about the noises coming from my computer. It took a surprising amount of effort to get everything down to a tolerable level. I eventually got it so quiet that I moved my cpu tower onto my desk next to my monitors (and out the reach of toddler fingers).

For those of you wondering, here's my recipe for a quiet PC. First, get rid of any small fans in your case. They  Replace them with bigger fans, then run those larger fans at a slower RPM. Fan speeds can be tuned in your computer's BIOS, so hit that F2 (or whatever) button on boot and get in there to tune them down. Run them as slow as you can. Your computer will shut itself off if it gets too hot, don't worry about it. Computers are not people, they have a different temperature range they can run safely at. If you absolutely think your computer is too hot, add more big fans running at the slower RPM. After a lot of research and chatting with friends, I settled on the Cougar quiet fan listed below. It has a good blade design and rubber grommets where the screws connect it to the case.

Next, get rid of the default CPU heat sink and fan. I went with liquid cooling at first, but since you have to run a fan over the radiator this doesn't have any benefit over just getting a larger heat sink. After I had one liquid cooling unit stop working I actually eventually settled on just a big heat sink to replace it, which was also way cheaper. I replaced all fans with the large Cougar fan I mentioned above, the default fans on everything were noticeably louder. I like passive heat sinks that allow me to move air parallel to the board. This allows me to arrange the fans to move air through the case in a coherent way rather than creating eddies. Here's links to the hardware I'm talking about:





I had it good for while with this setup, then I messed it up; here's how. This is a picture of the front of my computer for context. Sorry for the low quality, it's dim in my office right now.



I've had this case for over ten years, it's an Antec 1080 and it's been great. I splurged at the time, but it ended up being worth it. After a few years of strictly console gaming (7ish) I've started using my PC to play games again. This is mostly because my PC is in my office and I'm not comfortable playing most video games in front of my children yet. The content of most modern video games is perfect fodder for children's nightmares and I don't want anything to do with that. Now, I run Linux for most everything, but occasionally there's a game not available and I have to use Windows. For a long time I just partitioned my drives and dual booted to accomplish this. But this solution is rife with problems that I don't want to deal with. So, I bought an extra SSD drive and a 4 bay hot swap device. It's the thing in the picture with the four red clips. I only have one of the drive bays actually connected to the computer (top left); the other three are just holding slots. With this setup I can have four different operating systems and when I want to switch I just shutdown the computer and switch the drive I want to use to the active bay and power up. It's been super convenient, so much so that I bought a larger hot swap bay to use larger drives. The larger hot swap bay is second from the top, just under the optical drive. With the larger bay I went with a trayless one. This way I can grab any old sata hard drive and just shove it in there without having to worry about screws or buying extra bays.



The only downside of these devices is, you guessed it, they come with integrated tiny fans. When first installed, the fans ran quiet enough and things were alright. But only a few months later and the little fan turned into a buzzing whining nuisance. Never trust a small fan. Thirty minutes with a screw driver and I liberated the nuisance from my computer case. Here's a picture of the scoundrel.


My computer is back to it's normal quiet self. Thanks to my big fans pushing air around my case I don't expect to have any drive heating issues. If I do I'll post a follow-up here. Yak = shaved. Now, I'm back to work.











Sunday, February 8, 2015

Internet knowledge


The exchange of knowledge happening via the internet is making everything faster and better. Even old skills like knife making.

"Much of this energy is relatively new. “When I first got into this business, in 1968, I had a hard time finding fifteen knifemakers from Alaska to Florida,” A.G. Russell, the ascot-wearing don of the modern knife market, told me. “I’ve got three thousand in my computer file now.” Nearly everyone credits much of this explosion to the Internet, which not only has made heretofore obscure items suddenly accessible, but also has spread knowledge about the craft behind these items to a younger generation. “The guys just starting out today, their knives are as good as the best makers’ fifteen to twenty years ago,” Steve Shackleford, Blade’s longtime editor, told me. "
http://craftsmanship.net/the-kitchen-bladesmith/

Tuesday, February 3, 2015

Big data benchmarks and High Performance Computing protocols

I was asked today about benchmarking between high performance computing (HPC) and big data systems. In the context, the discussion was about different software architectures deployed across commodity hardware; not the specialized systems you see on the top500 list. To clarify: supercomputers are something you buy, high performance computing (HPC) is something you do. Don't get them mixed up. Not very many people play in both areas, the few that do don't seem to think very empirically about both; so there's not much concrete information go by. This makes the question a difficult one, but this was my best shot at answering it at the moment.

My immediate response was that the lighter weight, less fault tolerant, HPC architectures will always be faster given the same number of resources, but they are of limited use for big data. Benchmarks comparing big data and HPC architectures to each other don't exist in a meaningful way because the benchmark algorithms used are different. Big data's most prominent benchmarks are sorting problems which operate on commonly available data. HPC's most prominent benchmarks are algorithmic, for example Linpack TPP, which is a system of linear equations (i.e. math stuff with no data).

In my own experiences, heterogeneous compute environments (aka GPUs) typically have used High Performance Computing protocols, like Message Passing Interface (MPI) based applications. This is because the jobs they are used for don't take a lot of time (right now at least). HPC protocols like MPI are notoriously bad at fault tolerance, so software or hardware failures cause the entire job to be restarted. We use HPC protocols because there's lots of legacy code written for them and it keeps the jobs fast, starting over isn't much if a big deal because the jobs typically run so fast anyway.

Restarts on the big data jobs are more of a problem because the jobs often run for much longer, meaning lots of wasted time if you have to go back to the beginning. A compute job running over 10,000 compute hours is almost certain to have a hardware failure. Software failures are much more common (locking conditions, out of memory, segmentation faults, bad data, a million other ways to fail).  Because of this, big data software is more fault tolerant. This makes it slower, but able to do those long running jobs on big data. Big data software has been getting drastically faster, but at the cost of using more computers simultaneously to attain that speed.

Lots of people have attempted to benchmark performance across the different systems (which are myriad in both of these categories I've mentioned here). But none of the benchmark standards have seen very wide adoption. This is because they are like the systems they're trying to benchmark: new, with plenty of room to get better. The fundamentals keep changing. In my mind this speaks to how new all of this stuff really still is. Everyone is still trying to find the best way to do things and making massive progress. It's an exciting time to be working with these technologies.

Some good benchmarks and graphics comparing the various platforms:
https://amplab.cs.berkeley.edu/benchmark/
http://www.bigdatatop100.org/
http://icl.cs.utk.edu/hpcc/

Wednesday, January 28, 2015

Strange hammer


When I first saw these parts I originally thought they were junk.


Then I figured out they slipped together to make a hammer. This was kind of interesting.


The outside of the head seems to be hand forged and heavy steel, but the face of the hammer was just soft wood and has been beaten in over the years. I wasn't sure why anyone would make such a thing or what they were hitting with it.


Then it dawned on me, this is the perfect hammer to hit the back of a chisel with. The heavy metal provides mass and the soft wood in a concave shape is gentle on the back of the chisel and won't deflect. Whoever made this was a genius.



I fixed the cracked handle by pushing wood glue into the cracks, putting some packing tape on a clamp and holding it all together until it dried. It ended up working perfectly. The handle is back to it's original strength and is now usable with it's perfect fit and original patina.


The rust on the head had to go. Die grinder with some scotch-brite made short work of the rust and shined up the metal.


I coated everything with a bit of boiled linseed oil so it won't rust again. I've found the trick to applying boiled linseed is to apply generously but then do your best to wipe it off with paper towels. If you don't get aggressive wiping it off you'll end up with a stick mess on the metal when all you really need is a very fine layer to protect it.




Friday, January 23, 2015

Velociraptor

New project.

Velociraptor means "swift grab" and this software is a parallel tool that uses your AWS EC2 account to swiftly fetch lots of web content.

https://github.com/ericwhyne/Velociraptor

Velociraptor takes a list of urls and launches parallel AWS EC2 instances which then run wget in parallel to fetch the html and images into Web Archive .warc files and store them in an S3 Bucket. Parallelism is managed by GNU Parallel and AWS is managed by AWS CLI. This project is not associated with AWS and may use other cloud service providers in the future.



Thursday, January 22, 2015

Data Science tactical and strategic

Selling data science at tactical level is a matter of patience, you have to convince people to pause and use data before making tactical decisions.

Selling data science at the strategic level is a matter of trust., often lots of thought and time goes into strategic decisions, you have to convince people the analysis of the data is correct.

Tuesday, January 20, 2015

Shoe rack

I make a habit of snatching the pallets when we order servers rather than letting them be thrown in the trash. Colleagues at work have taken note of this and started setting them aside for me, so I now have more than I could ever use. This is fine, my wife and I sometimes like rustic looking things which also seem to be in style right now. What doesn't get turned into projects makes great wood for the firepit in the back yard.

This weekend I made a shoe rack for the bottom of our closet.

Here's a picture of the original problem, unsorted shoes.



After pulling nails and other metal out of the pallets, I cut them to shape on the table saw. Then glued and used my pneumatic brad nailer to stick them together like this.



I didn't use any finish on them. Just left as raw wood. The bottom of the closet is a stable and dry environment, the plain wood will literally last for centuries like this.

I screwed the runners to the inside of the closet wall and then just set the shelves on them. This way, if needed, we can just lift the shelves out without having to undo any screws.

Here's a picture of the finished project.