Wednesday, November 20, 2013

Software is getting better!

A friend made a joke about May's Law today. David May states his law as follows, “Software efficiency halves every 18 months, compensating for Moore’s Law.” Larry Page has made similar statements, of course referred to as Page's Law. Sure, it's profound, funny. and good marketing to attempt to make assertions about software always seeming to run slower. But the truth of software efficiency writ large is the opposite. Here's a report that addresses the assertion directly.


Kurzweil's book "How to create a mind" tipped me off to this report and specifically this section on page 71.
Progress in Algorithms Beats Moore’s Law
Everyone knows Moore’s Law – a prediction made in 1965 by Intel co-founder Gordon Moore that the density of transistors in integrated circuits would continue to double every 1 to 2 years. Fewer people appreciate the extraordinary innovation that is needed to translate increased transistor density into improved system performance. This effort requires new approaches to integrated circuit design, and new supporting design tools, that allow the design of integrated circuits with hundreds of millions or even billions of transistors, compared to the tens of thousands that were the norm 30 years ago. It requires new processor architectures that take advantage of these transistors, and new system architectures that take advantage of these processors. It requires new approaches for the system software, programming languages, and applications that run on top of this hardware. All of this is the work of computer scientists and computer engineers.Even more remarkable – and even less widely understood – is that in many areas, performance gains due to improvements in algorithms have vastly exceeded even the dramatic performance gains due to increased processor speed.The algorithms that we use today for speech recognition, for natural language translation, for chess playing, for logistics planning, have evolved remarkably in the past decade. It’s difficult to quantify the improvement, though, because it is as much in the realm of quality as of execution time.In the field of numerical algorithms, however, the improvement can be quantified. Here is just one example, provided by Professor Martin Grötschel of Konrad-Zuse-Zentrum für Informationstechnik Berlin. Grötschel, an expert in optimization, observes that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later – in 2003 – this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms! Grötschel also cites an algorithmic improvement of roughly 30,000 for mixed integer programming between 1991 and 2008.The design and analysis of algorithms, and the study of the inherent computational complexity of problems, are fundamental subfields of computer science.
The report goes into more detail discussing research priorities. Those in an argumentative mood might point out that algorithms are just a subset of software and overall software efficiency has been decreasing. I've made this point before when writing and talking about the big data technologies. We've certainly seen a change in the way data technologies are developed. We've gone from cleverly conserving computing resources to squandering them creatively. But it's not as if we're doing the same thing but just less efficiently; whole new capabilities have been opened up. The data redundancy of hadoop file system (HDFS) means we can process larger sets of data and overcome hardware failures which are inevitable on large "compute hour" jobs. When you're employing thousands of disks or cores in your jobs, the chances of an individual failure are increased. The inefficiency is a risk mitigation strategy, storing the same data three times (by default) certainly isn't efficient, but it makes it possible to do very large jobs.

The data processing technology improvements are just one example; there are many like this across the board. Remember XML? Json is definitely more efficient. Machine learning implementations are getting faster. As a counter to the improvements, more people are attempting to misuse your personal resources, which may lead to things seeming sluggish if you're incautious. If you're wondering why your Windows operating system seems slower, that's a whole different story. I think that in closed code bases there might be more of an incentive to hang on to old things leading to inefficiency; but that's just conjecture based on personal experience, read more about my experiences on data munge. Maybe somebody will do a study on closed vs open execution speed over time. Until then, I'll hold suspect any piece of software where a large community of developers can't look at the code base and improve it. There's a law for that too.    

Saturday, November 16, 2013

Anonymous Web Scraping

Web scraping is the act of retrieving data from a web page with automated software. With a few lines of perl or python you can gather massive amounts of data or write scripts to keep an eye on web pages and alert you to changes.

There are a few reasons for not wanting this traffic to look like it's coming from your computer. Sometimes web admins or intrustion prevention systems will label automated traffic as malicious and respond by blocking your IP address. Other times, such as system monitoring, you just want to see what happens when you access certain pages from elsewhere on the Internet. Scraping, especially anonymously, is certainly something that can be abused. But I think there's enough good reasons to do it that I'm comfortable writing about this.

The oldest trick I know to remotely scrape web pages is to purchase or own a shell account. These things still exist, still mostly for IRC bots. Here's a list of 92 shell providers with locations in 13 countries. Somewhere along the way, IAAS providers such as Amazon Web Services and Rackspace made it easy to provision remote machines in a selected availability zone. They are as convenient as buying a shell, but more powerful since you have root access to your own operating system. Companies such as have made it their business to help people crawl the web and gather data.  Various anonymizer proxies  are available, but not worth much because they get blacklisted so quickly and often have terms of use prohibiting scraping or bots. Lastly, if you're researching this topic, you need to be aware of illegal botnets. It's criminal, and of course I don't advocate it, but some people make a hobby of taking over large numbers of home computers and put them to work doing things or just being creepy.

Various ways to anonymize scraping:
1. Shell account
2. IAAS provisioned server
3. Scraping companies
4. Anonymizer proxies
5. Being a creep

These are still viable ways of gathering data without using your own IP address. But, I think I've found an easy, powerful, and convenient way that sidesteps the downsides of each of the approaches listed above. One of my projects required me to scrape some data fairly regularly, so I decided to start from scratch and see if I could eschew these traditional approaches and use the Tor network to avoid getting tagged and blocked by an Intrusion prevention system.

From their website: "Tor is free software and an open network that helps you defend against traffic analysis, a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security."

I've been toying with Tor for the last few years. It's been useful when getting around bad corporate VPN or routing problems. More than once I've been stuck in a corporate conference room or venue with poorly configured routing or bizarre filtering. Not a problem, Tor has always been able to get my traffic out to the open Internet even when the corporate networks have blacklisted commercial proxy providers. When most people use Tor all they see is the Tor browser bundle and the Vidalia interface. Underneath the hood, these two components interact with a Socks Proxy which is your interface to the greater Tor network. By default, if you're running Tor on your machine this proxy is exposed through port 9050 and can be accessed by anything that can use Socks. This is good news because Socks is pretty standard and libraries exist for most programming languages.

For our purposes we're not concerned with extreme privacy. If you are, please go to Tor's web site and download the latest version. We just want to use the onion routing network, so Ubuntu's packages are fine.

sudo apt-get install tor
sudo /etc/init.d/tor start

You're computer should be connected to the Tor network now and have a socks proxy listening on port 9050. If you're not behind a software or hardware firewall, you might want to enable some sort of authentication. Then you should go buy a cheap hardware firewall and install it. After that set aside time to re-evaluate the rest of your security practices and contemplate your approach to life in general. If you thought it was ok to have an unprotected computer connected to the Internet you're probably going wrong elsewhere.

ss -a | grep 9050
tcp    LISTEN     0      128                  *:*

We'll use python to take advantage of that proxy. There's only one module that needs to be installed from a default Ubuntu 3.10 system. Here's the command to get it.

sudo apt-get install python-socksipy

Any scripts you've written with python and urllib2 are pretty easy to re-use anonymously. Here's the simplest working example.

import socks
import socket
import urllib2

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "", 9050, True)
socket.socket = socks.socksocket

url = ''
data = urllib2.urlopen(url).read()
print data

By building out the above code you can anonymously scrape. Build an array of urls then create a loop to iterate on it and you're off to data heaven. But your exit node will always be the same IP address. The way to get around this is to launch multiple connections to the Tor network.  Tor can be initiated on the command line as such. You'll need to specify unique control ports, socks ports, Pids, and DataDirectories for each instance. It's trivial to write a shell script to do this 5, 10, or even 100 times with unique values. Proceed with caution. I've noticed that Tor nodes can chew up a lot of CPU time. By using this trick and varying which entry node your scraper uses you can distribute your traffic across <n> internet IP addresses pretty easily.

tor --RunAsDaemon 1 --ControlPort <control port> --PidFile tor<n>.pid --SocksPort <socks port> --DataDirectory data/tor<n>

Please share the link to this page if you found it useful.

Updated Dec 2014:
Although it is possible to scrape websites using Tor, please don't engage in behavior that would get the exit nodes blacklisted. There are only a finite number of them and they are a valuable resource for the community.  As another option, I've written a very lightweight http scraping proxy which can downloaded from my github account here: if you're scraping behavior may end up in blacklists, it might be best to use something like this on ephemeral hosts that won't damage community resources.


Friday, November 15, 2013

Memory vs Disk for Data Platforms

Some interesting charts. I wrote this in response to a colleague asking my thoughts on a paper Intel released.

Graph of memory and Disk prices:

Corporate Data Growth:

Of the two charts above, only one can be represented in a browser without resorting to the use of a logarithmic scale.

I had to copy 10Gb of files to a spinning disk last night on my desktop (something now to be avoided whenever possible). It went at a blistering 10Mbs. The NIC in the machine operates at 1Gbs, my Internet connection is 75Mbs. Something is wrong here.

Cool new in memory projects that are gaining momentum:

The important chart:

After looking at the evidence I'll comfortably make the assertion that disk is dying as a medium for anything other than archival storage. This is a different strategy than cache optimization of RDBMS and related technologies. However, optimizing code and algorithms to avoid Cache misses is stil cool and useful.

Because corporate data growth is progressing slower than memory is getting cheaper and more plentiful, it makes sense to seriously evaluate using architectures that have big enough memory to hold the entire data sets.

I suspect in the foreseeable future memory will creep closer to the cores (much bigger caches) or the cores will creep closer to the memory (new architectures?). Although this hasn't seemed like something folks have been looking into because of the current lack of software written to run well with these new capabilities.

Sunday, November 10, 2013

Baby monitor fix

Our Samsung baby monitor has seem some hard use and stopped working due to a broken power connector on the screen unit. Here's the model:

When it broke we didn't really need it as much. Our son was old enough to be fine on his own in the crib; no worries about him not being able to roll over or push a blanket off his face. Everything changed last week when our son managed to crawl out of his crib. Today I gathered up the parts from the monitor and sat down to do my first bit of soldering in our new house.

I managed to find everything but one of the power adapters. A glance at the output voltage of the remaining wall wart let me know that the board is expecting to get 5V and 1A. That was great news since that's right in line with the power levels of most USB chargers. I ran off to my bin of spare cables and came back with a mini-b usb cable and set to work determining the pinout and cutting it apart. I was in full "mad-scientist mode" instead of contemplative "engineer mode"; this is when I make most of my mistakes. I'll mention how I screwed up later, but it started with chopping this cable up prematurely.

My collection of chargers range from 300mA to 1A. I ended up having an 850mA charger handy to test with; close enough (remember mad scientist mode here...). I verified everything on the monitor PCB by poking around with a multimeter on it's continuity setting. Most flexible cables, such as USB cables are made of lots of smaller pieces of wire twisted together. I hate trying to solder cables like that directly to a printed circuit board because there's always the potential of a stray wire strand shorting a connection. Digging through my parts bin I found some integrated circuit boards and cut one apart. I could then take my time soldering the stranded USB cable to this instead of directly to the monitor's PCB.

With my USB cable soldered to the IC PCB piece, I hot glued the IC PCB piece to the monitor circuit board with the IC side pin holes lining up perfectly over the broken power connector holes. Note in the picture below that I unplugged the lithium battery before I did this. Accidentally shorting rechargeable batteries can cause explosions or fires. Once the hot glue cooled, I placed a solid wire down through the hole and held the soldering iron against it. As the wire heated, it melted the solder it was touching below. After removing the heat, the solid wire was permanently connected to the monitor's board. I clipped the wires short, soldered the top to my IC PCB and verified everything was still good to go with the multimeter's continuity setting again.

A final inspection, a little more hot glue, and I plugged in the power for the big test. It works! Great!

I clipped a hole to allow for the cable to come out of the case, then hot glued everything like crazy so that strain on the cable won't pull on any of the circuit boards.

If you don't have a hot glue gun, they are cheaper than you'd expect.

I left 2 or 3 ft of cable since I didn't expect the battery to still hold a charge, meaning we'd only be able to use it when plugged in anyway. I was wrong, but it's only a matter of time until the battery craps out, so I don't feel too bad about it.

That mistake I mentioned earlier... here's the details.

In my haste to fix the monitor I didn't really take a close look at the rest of the board. I just assumed that the section of the board with the broken barrel power connector and battery connections would be the only way to power the board up. After the fix when I was putting everything back together I noticed a mini-b usb port on the opposite corner of the board! Unfortunately, the mini-b usb that I had just chopped up to complete this fix was the only one I had. This fix could have been as easy as cutting away the case around the mini-usb port and plugging in the cable, saving me the time and parts and making for a much cleaner fix. I may never know. I cut the case away anyway; maybe I'll find another mini-b usb cable somewhere and give it a try.

Saturday, November 2, 2013

Axe Handle

I found two forged steel double bit axe heads pitted with rust and forgotten on a shelf in a barn. It's hard to find forged steel axes (now expensive and rare) so I took the time to clean the smaller of the two and put a nice handle on it.

Most modern axes are cast. During casting molten metal is poured into a clay or sand mould and left to cool before final shaping is done by removing excess metal by filing or machining. When casting, the metal in the axe ends up being homogeneous and of a type that can reasonably hold an edge, somewhat tough and somewhat brittle. Trade-offs are made. In contrast, forged axes are created by heating and pounding metal into shape. When done in mass production, as this axe head was, it's usually not a person doing the pounding but rather a giant mechanical hammer. Forging requires that the metal be somewhat malleable (aka tough) so it won't tear during the forging process. Since malleable steel doesn't hold an edge, a more brittle piece of steel with a higher carbon content is forge welded to the edges of blades. Metal also has "grain" much like wood does. Forging metal stretches and aligns this grain with the shape of the object. The forging process results in a nearly indestructible axe head that will keep a razor edge for a long time. I know, axe heads aren't necessarily known for being fragile, but realistically there is no chance of hair line fractures in the metal near the handle since it is a tougher metal in the center. The axe can also be made thinner and lighter since the center part won't crack if misused as a hammer or when accidentally striking a rock.  For a more in depth explanation of these concepts, check out this great video by Ben Krasnow on heat treating metal. I took some care to avoid unecessary heating of the metal to avoid changing any of the temper.

If you spend any time shopping for a serious axe for wilderness treks or
timber work, you'll note they are all forged and cost several hundred dollars. Of course, if you don't care about any of this you can go to home depot with 30$ in your pocket and walk away with a cast axe to thump away at stumps in your yard. It will be over-built to compensate for the brittle metal and not keep an edge very long, but they're great for banging off rocks while cutting roots. I have one of those sitting in the corner of my garage covered in mud. This axe will lead a very different life. There is a maker's mark on one side of the axe and a Flint Edge logo on the other. Some online searching and I discovered that the latest this axe head might have been forged was 1949, but it could be a few decades older. All this considered, I was excited at the chance to bring this beauty back to life and own a quality unique axe that's also a piece of history.

The lines near the edge of the blade edge (visible in the image to the right) are where the high carbon steel was forge welded to the rest of the axe head. Both pieces of steel were heated to near melting, covered in a flux (common household borax works), and pounded together. I like that the lines are so clear on this axe; they speak to what it is.  The red oak should be better than the osage at resisting over-strikes, and I think the osage is a little lighter and more flexible than red oak. If I ever do any serious timbering it will be with a chainsaw, so I made the overall length more like a smaller forest axe as opposed to a long felling axe. It's 29 inches long. I think this is the perfect size to put in the truck for longer camping trips, fishing, or in the canoe. Short enough to pack or split kindling semi-safely, but long enough to get a good swing at something big.

The video starts off with me showing how to pound off an axe head without destroying it. I rarely see an older axe head that isn't deformed from hammer on blade contact. A safer and less destructive way is to drill it out and use scrap wood to protect the blade from the hammer. If you look closely you'll notice the head I remove is on backwards. I didn't record the original separation and it was easier to put it back on that way. Both of these axe heads have an aggressive reverse taper inside them. I ended up compensating for this by pounding wood wedges soaked in wood glue into the final axe. The glue soaked wood slivers nicely filled up the top of the axe and filled out the rest of my shim slot. It was easily trimmed back with a saw blade and made uniform with a quick touch from an angle grinder with a wire wheel. This head will never come loose, move, or let moisture get under it. I made the metal shim by putting a taper and some grooves on a piece of scrap metal (forgot to record that). The process of banging on the head and the shims ended up destroying the bottom of the axe handle. That was another reason to put the extension on with the box joint jig. I think this concession wouldn't do for a real axe connoisseur or even for a serious axe; but I did the work and I know the joints are solid. I don't think it will break even under severe use. It's far away from any critical stress points in the handle. Plus, I think the contrast makes it look good and will leave most woodworkers puzzling over how I was able to make such a clean joint with those dimensions. If I ever do this again I'm going to make the handle longer and wider than needed and cut it to length and shape it after the head is on and the shims are in. Or spend more time custom fitting the head; I may have made this all go together a little too tightly.

Functionally and aesthetically I think this turned out to be a great tool. Axes are most useful when chopping the sap free wood of fall and winter while camping. The leaves are turning, maybe I'll get a chance to test it out soon.

Update: I put a handle on the other axe head today. I ended up not sanding it, just wire wheeled the rust off. For ten dollars at tractor supply I scored a hickory axe handle with hickory shim. It took me less than 10 minutes to fit it with a file then glue and pound in the shim. Practical, but less fun than making one from scratch.  After the glue dried, I took the time to sand off the polyurethane coating and cover it in some danish oil. The reason for an oil finish is repair-ability as opposed to durability. If I nick the handle, I can just dab some oil on it and it's protected; surface coatings are not as resilient or easily repaired.

While sharpening I noticed that the steel was much harder than any of my other axe heads. The file didn't want to bite into the metal while sharpening. A skittering file is a common way to diagnose a hard steel. I ended up using my orbital sander to shape the initial edge. I'll wander into the garage and finish off the edge later this week. Here's a picture: