Saturday, December 7, 2013

Two years of my book reviews



My wife says I read more than anyone she has ever met. I thought that was a pretty nice complement, she reads voraciously as well and we both want to pass that on to our son (and pending daughter). Even though he's only two years old I don't let him go to bed without us cracking a book and pointing at some pictures. The work pays off, he can already recognize most colors, shapes, and some numbers.

I started this post as a summary of what I read in 2013 (since it's the season for end of year blog posts) but I decided to include 2012. While looking at my notes there were a ton of good books that I have never mentioned here. I could go further, but I had to cut it off somewhere, this list is already out of control. I think I'm writing this in part to avoid real work, so there's that too. This ended up being an epic blog post, but I honestly enjoyed pulling it together and reviewing what I've consumed over the last two years. It's good to look back.

How I was able to do this is kind of interesting and unintentional. When reading a book I'd always fold over pages and put little check marks near lines I thought were exceptional or worth revisiting. This later evolved into those little arrow stickers and eventually dictating the lines to my Android phone and Evernote. Since my Evernotes have time stamps I can tell when I finished reading something, which is nice and enables me to look back at what I've read over a certain time frame. 

So here it is, a list of what I've read over the last two years and what I thought of it. If I thought the book worth recommending I took the time to included a picture of it's cover. If I thought it was very good the picture is bigger.



Non-Fiction:


In the Plex – Steven Levy
Still reading as I'm writing this blog post. I like the company more after reading most of this book.

Radical Abundance – Eric Drexler
Eric Drexler's 1987 book Engines of Creation ignited the fascination with everything nano-technology over the last 30 years and is one of the most inspiring books I've ever read. This book details how the politics of government funding and
media hype crushed the prospects of seeing his Engines of Creation predictions come true in the last few decades. After lambasting some fools, he lays out some new predictions. My favorite line of the book: If you understand the implications of this and feel like telling everyone you meet, go lie down on the floor and wait until the feeling passes. I laughed at that, he's right; most people aren't ready to hear how much things are going to change. It's a lot of work to think this stuff through to the natural conclusions and few people want to do that work.

How to Create a Mind – Ray Kurzweil
If you're reading to catch up on Law of Accelerating returns predictions and haven't read any of Kurzweil's other books, start with The Singularity is Near; you won't find many predictions in this book. What you will find is well researched support for a theory about how the human mind works and a survey of the current state of research around building or emulating it. My favorite part was the chapter “Thought Experiments on the Mind”. In usual Kurzweil style he broadly surveys the literature around each of the topics presented and then responds to each with his own thoughts or cross references. Each book Kurzweil writes seems to be better than the last.
Age of Context – Robert Scoble and Shel Israel
Worth reading, the authors are tuned into what's going on; but so am I so nothing in here was a big surprise. Pretty much a list of companies and why they are disruptive. I'll be investing when they have IPOs, my notes are a list of who I think is worth watching closely. I think the top company to watch in the next few years is https://www.uber.com/

Turings Cathedral – George Dyson
Great read! A historical discussion of the first few years of computers. A great reminder of where all this stuff came from. My favorite part, the mention of A Million Random Digits with 100,000 Normal Deviates. To find out why I think that book is cool, read this Dr Dobbs article about how a challenge to compress the file has lasted ten years! http://www.drdobbs.com/architecture-and-design/the-enduring-challenge-of-compressing-ra/240049914


Social Intelligence – Daniel Goleman
Not on the top of any of my lists. A collection of pretty obvious anecdotes about how social intelligence is as important or more important than raw intelligence. To be 100% honest, the books I read from a christian perspective from authors like Andy Stanley or John Maxwell are more useful and contain better reminders and solid advice about how to treat people well and behave in life. A few years of neuroscience research doesn't hold a candle to thousands of years of research and practical application.

Wheat Belly – William Davis
I went Gluten free after reading this ...for almost two months. I decided that the inconvenience wasn't worth it and went back to eating bread and cereal. Most interesting concept: there's little incentive for medical research that doesn't result in a drug or device that can be sold at a profit. In the case of celiac's disease there's an entire industry even potentially suppressing research. Bummer.

Presenting Data in Tables and Charts – David Levine
zzzzzzzzzzzz boring. If you want to learn how to present information read Edward Tufte or Colin Ware.

Sexy Little Numbers – Dimitry Maex
Very worth a read. My favorite part are the detailed descriptions of marketing strategies. I liked the metrics driven approaches with acorns, golden nuggets, and jackpots vernacular. Some cautionary tales of failure as well: “Some people use data analytics like a drunkard uses a lamp post, for support not illumination.”

Leadership Gold – John Maxwell
Loved this book. I had read it when I was a Marine Lieutenant and re-read it this year. Some of my better random notes: Great people develop those around them. Small people will attempt to put the same limit on others that they put on themselves.” “Activity does not equal accomplishment. Twenty five years of experience is the same as
one year of experience if you just repeat the same year over and over again and don't learn. Reflection turns experience into insight. Experience alone means nothing, evaluated experience means everything. If a cat sits on a hot stove, it will never sit on a hot stove again; but it won't sit on a cold stove either. Cat's don't have the mental capacity to evaluate risk, sometimes people are the same way.

How to Stay Motivated – Zig Ziglar
I heard somebody make a joke about Zig, and I didn't know who he was so I picked up this book. It actually was pretty decent. Written in 1960 something, everything still was good. My favorite note from this book: Develop people like you mine for gold; expect to move a lot of dirt to get to the valuable stuff. People will become what you tell them they are or will be and words matter.


The Frontiersman – Allan Eckert
I can't recommend this book enough. Amazing historical fiction following the life and times of Simon Kenton. The challenges people faced during the 1780s in America make even the worst problems of the world today seem silly. After reading I bought a tomahawk and a canoe (just because I'm 32 doesn't mean I have to act like an adult). One of my unfortunate discoveries are that nearly all historical flintlock rifles have been destroyed, having been converted to percussion cap firearms. After reading this I bought the next book on my list.

Firearms, Traps, & Tools of the Mountain Men – Carl Russel
I think I got this book for a few dollars on Amazon, it's probably out of print. Lots of great pictures of illusrations and historical descriptions of stuff that doesn't exist anymore. The greatest minds of our time are writing software or working on space travel; the greatest minds of this time were making mechanical devices like traps, rifles, and refining metallurgy. The genius in their designs is apparent.

North American Bows, Arrows, and Quivers: An Illustrated History – Otis Tufton Mason
Mostly I just looked at the pictures. Then I made a Cherokee bow out of hickory and sinew string (not a joke, really). Maybe I'll write a blog post about that. Amazingly there is a kindle version of this book.

Camping and Woodcraft – Horace Kephart
Fantastic advice on, well, Camping and Woodcraft. Written in 1906. Not to be confused with the next book on my list.

Woodcraft and Camping – George Sears Nessmuk
I'd rate Kephart's book as better, but Sears published his 20 years earlier in 1884. The best part of the book was his obsessive quest to find the perfect double bit hatchet. These are still incredibly hard to find in quality form.

Kant in 90 Minutes – Paul Strathern
I picked this up because I had heard about the Kantian principle of not utilitarianism and Kantian ethics. http://en.wikipedia.org/wiki/Kantian_ethics It was an insight into his life and times and what drove him. My favorite part is when they describe the difficulty of scheduling events in his home country because it had 4 official time zones, none of which were based on sensible geo-spatial regions. Couple this with a lack of accurate clocks and hillarity ensues.

Do the Work – Steven Pressfield

Turning Pro – Steven Pressfield
Both great books in their own merit, but not as good as the next book on my list. I believe they are both derivative of it, but don't come near it's excellence.

The War of Art – Steven Pressfield
If you haven't read this book, drop what you're doing and buy it right now. This is one of the big ones. Hugely important. There was a guest speaker at my church that made a passing reference to it and I was surprised that Pressfield had written non-fiction. I had only read his “Gates of Fire” years before. It's Pressfield's secret guide about how to channel creativity, overcome procrastination, and get things done. I've read it three times over the last two years and every time it's a real kick in the butt to get going.

Purple Cow – Seth Godin
Why is the Mona Lisa special? Because it was stolen. Things are desirable because they stand out, rarely due to objective quality. This book lays out a bunch of great examples of this. Seth has a great blog too. Check it out. http://sethgodin.typepad.com/
Linchpin – Seth Godin
I picked this book up in an airport at random. It looked good and it really was. It details why the employee employer relationship has changed and how to ensure job security by choosing to be exceptional at what you do.

All Marketers are Liars – Seth Godin
No really. They are. Lies can be good: the wine glasses that make wine taste better but only if somebody tells you they are supposed to make it taste better (placebo effect). Lies can be bad: Nestle marketing powdered baby formula as healthier causing young mothers in Africa to forgo breast milk and killing their babies with polluted water mixed with the advertised formula.

Born to Run – Christopher McDougall
I had noticed a co-worker wearing toe-shoes and struck up a conversation that included him recommending I read this book. So I did. It details the adventures of ultramarathoners and a native tribe in Mexico called the Tarahumara that are legendary for their running ability. Some analysis of why modern sneakers are bad for us and flat shoes/barefeet/sandals are good for us. I buy shoes differently now, opting for flat soles and the least “support” possible.

Everything is Obvious – Duncan Watts
Duncan Watts gave a presentation at a local TED conference, that I missed. I was having lunch with a colleague that recommended his book. It was a good discussion of how unpredictable things are and that we can pretend to see indicators after the fact. A cautionary tale of the predictability of many things.

Data Mining – Ian Witten, Eibe Frank, Mark Hall
A somewhat rough introduction to Machine Learning. Programming Collective Intelligence was a much better book if you're just starting off. Weka is a very broad ML tool and this book covers a lot of it. For ML tasks and education I've mostly fallen back to the python ml libraries and vowpal wabbit, making this book kind of useless for me. However, if you're invested in Weka, this is a must-read.

Coders at Work -  Peter Seibel
Interviews with some of the most famous and talented programmers on the planet. Peter Norvig, Ken Thompson, Don Knuth, and many more. My favorite part: they all use print statements to debug instead of debuggers. I was always kind of ashamed I never spent much time with break statements and debuggers, but I feel better about it now.

The Better Angles of Our Nature: Why Violence Has Declined – Steven Pinker
I saw his TED talk, then I got half way through his book. I get it. I agree with his observations and it's hard for me to understand why more people don't get this. For further evidence read The Frontiersman, which I noted earlier. Violence is declining.

The Elements of Style – William Strunk
I bought it on a whim. It was mentioned in one of the interviews in Coders at Work. I'd hardly say I read this. More like skimmed through it and tried to figure
out why they thought it was good. I'm sure it's good, but there are better books on this (like the next one on my list).

Essential Communication Strategies for Scientists, Engineers, and Technology Professionals – Herbert Hirsch
This was a re-read. I read it several years prior. Pragmatic and funny, this is my go-to book for good technical writing.

The New Digital Age – Eric Schmidt, Jared Cohen
Pass; I finished it but it got to be a struggle. Two interesting people, not a very interesting book; maybe they canceled each other out. This is more of a "what just happened" book instead of a "what is about to happen" book.

Seeing Further: The story of Science and the Royal Society – Bill Bryson
I read this shortly after reading Neal Stephenson's Baroque Cycle. Freaking fantastic. I read Bill Brysons History of everything a few years ago and this was an excellent book as well. I also read At Home, which was good too. Bill is a talented non-fiction author and he can find the interesting aspects of almost anything. In this books he covers the foundations of modern science which makes it double-interesting.








Fiction:


Daemon – Daniel Suarez
My top recommendation for fiction. Eccentric millionaire video game company owner dies and a computer daemon and assets he has prepositioned wreak havoc on social structures. A little gory (the first few sentences have a beheading) but as the story goes on it gets smarter and although there is a lot of violence the plot isn't cheap. The book is fantastic. Interesting that Daniel had to work so hard to get it published.
Freedom (TM) – Daniel Suarez
The sequal to Daemon. It's even better than the first book and a lot of what the eccentric millionaire was trying to accomplish makes more sense. I found myself annoyed that there wasn't a third book.

Kill Decision – Daniel Suarez
I read this because Daniel's other books were awesome. This one, not so much. It talks about autonomous UAVs, which is cool, but it didn't really resonate with me the way the other two books did. Not enough geek culture, too much government conspiracy stuff. If I could go back in time I'd tell myself to pass on this one.

Earth Afire – Orson Scott Card
I read this in preparation for the Enders Game movie to come out. This is the prequel to Enders Game. I felt like Orson Scott Card was really off his game. It turns out my suspicions were half right; he had a ghost writer co-author this one. Bummer. Unless you're a die-hard Ender's game series fan (like I am) don't bother. If you are, pick it up; it's not terrible.

A Fire Upon the Deep – Vernor Vinge
A few years ago I read Rainbows End on a recommendation I received from talking to somebody at a conference. A Fire Upon the Deep did not dissapoint. Hardcore science fiction, singularity meets sentient artificial intelligence meets interstellar travel. All awesome.

A Deepness in the Sky – Vernor Vinge
Not as good as A Fire Upon the Deep, but still ok. AFUTD really reasonated with me, this one didn't as much. I don't know if it was my mood when I read it or the book.

Some Remarks – Neal Stephenson
I always enjoy Neal Stephenson and buy anything he publishes. The best part about this book is the best part can be read online for free. Out of all the essays I enjoyed “Mother Earth Mother Board” the most. It was originally published in Wired and is still there: http://www.wired.com/wired/archive/4.12/ffglass.html


Last of the Amazons – Steven Pressfield
Some women cut off one of their breasts to be better archers. I enjoy reading about warfare, women's rights, and archery. Having all three together may have spoiled it for me. Not my favorite Pressfield book.

The Afghan Campaign – Steven Pressfield
Pressfield was a Marine and he writes better to the experiences of current and historical warriors than anyone else I know of. My favorite Pressfield book is still Gates of Fire, which is based on the Spartans at Thermopylae and served as a basis for most of the 300 movie. This book is a close second. It follows a handful of Alexander the Great's Army as they fight their way into Afghanistan in 330 b.c.

Ruins – Orson Scott Card
A while ago I read Pathfinder by Orson Scott Card, I enjoyed it so much that when Ruins came out I finished it within a few weeks of it's release. Orson Scott Card's first novel Treason was some of his best writing and I think in these books he takes the best aspects of Treason and his Space Sci Fi and mixes them together. A great combination. The book ends on a cliff hanger, so I expect another book soon. Unfortunately I think Orson was side-tracked by the Enders Game movie and pressured into writing Earth Afire instead of continuing this series.

Interface – Neal Stephenson
It's neat to have Neal Stephenson writing a thriller set in Northern Virginia where I live. Weird to have him writing about politics, but the book has some neat concepts and scenes in it.

Anathem – Neal Stephenson
I can't write much about this one without plot spoiling. Each plot change was a major twist and it was extremely interesting. I can say that if you like Neal Stephenson you should pick this up. Neal stopped at google to discuss this book http://www.youtube.com/watch?v=lnq-2BJwatE and it was inspired by the Clock of the Long Now http://blog.longnow.org/02008/09/02/neal-stephenson-and-the-10000-year-clock/

The Castle – Franz Kafka
For a detailed review of this book, see my blog post on Federal Certification of Information Systems. I'm kidding... kind of.

Distrust that Particular Flavor – William Gibson
An exploration of William Gibson's unique version of Otaku. His obsessive interests really appealed to me when I was a teenager and my copy of Burning Chrome fell apart long ago from being read so many times. “The future is here, it's just not evenly distributed.”

Moby Dick – Herman Melville
While stuck in an air port I found a book titled “Why read Moby Dick?”; I read the back cover, then popped open my kindle app on my android tablet and got Moby Dick. Melville was a true master of words and the book is so fluid and lucid that it's almost like poetry.

Reamde – Neal Stephenson
Not a typo, but a great book on massive multi player video games, entrepreneurship, and terrorists. If the fact that Neal Stephenson wrote it and it contains these topics doesn't have you rushing to buy it, nothing else I can say will sway you.

World War Z – Max Brooks
I've heard this book was good for years but never picked it up because most of the zombie craze doesn't really excite me. Don't get me wrong, I enjoy a good zombie movie or show as much as anybody; maybe an aversion to media glorifying post-apocalyptic scenarios is a side effect of having lived as a US Marine in Iraq for over two years. But this book turned out to be as fantastic as everybody said it was. My favorite chapter: the downed female pilot. Read the book, you'll know what I'm talking about when you get to it.





Faith:


Things a Computer Scientist Rarely Talks About – Donald Knuth
Don Knuth aka “The Father of Computer Science” was asked to come to MIT and give a series of lectures on whatever he felt like. So he talks about infinity and what probability theory can tell us about free will. Then he talks about whether or not mathematics can enhance our personal understanding of the bible. This book is an all-out defense of the faith from one of the smartest people on the planet in front of one of the smartest audiences on the planet. Don was somebody I respected deeply way before I decided Christ's words made sense. Finding out this book existed was one hell of an affirmation of my beliefs. I plan to read Don's other book on Christianity called 3:16 in which he randomly samples bible verses and deep dives into researching the passages that come back.

Revolutionary Parenting – George Barna
One of the best parenting reads I've had. I abhor hearing parents say things like “kids are just going to do what they are going to to do and we can't stop it”. I think that a huge part of what our children become is based on our expectations of them and SHOWING THAT WE CARE. Let them grow, but show them that you care about how they turn out by being involved in their lives and asking about how they feel about things and letting them know how you feel.

Enemies of the Heart – Andy Stanley
This book covers how to deal with four emotions that every human has to contend with: guilt, anger, greed, and jealousy. Being wise doesn't mean you don't have problems, but it does mean that you won't be the source of your problems. This book is full of great practical advice for getting control of these emotions and making sure they don't control you decisions.

Deep & Wide – Andy Stanley
In northern Virginia we have access to some awesome churches that are intelligent, supportive, committed, vibrant, and packed full of people that generally really have their lives together and are eager to help others out (of course there are exceptions but they are welcome too!). Andy pastors a similar church in Atlanta. This book covers how he did it. When I travel to other areas of the country where churches are in decline I sometimes get disappointed and wonder how they even still exist. I think some of them are sustained on mere habit or tradition and I think that's unacceptable. This book is about how to build one of the good churches.

The Case for Faith – Lee Strobel
Lee Strobel wrote The Case for Christ and this is a follow up to that book. Lee is one of billions of people that have come to follow christ after earnestly studying his words. This is a nice account of the second half of his journey as he attempts to grapple with some of the more advanced intellectual questions along the way.

What Christians Believe – C.S. Lewis
This book is a watered down summary of C.S. Lewis's Mere Christianity. It serves a purpose because Mere Christianity is a very intellectually heavy book; also one of my all time favorite books. Mere Christianity was given to me by a friend at a time when I wasn't paying much attention to the bigger questions in life. In that book he tackled the most difficult questions he faced on his journey from atheism to a follower of Christ. He wrote about intellectual cowardice and the need to constantly challenge your most basic assumptions and to go where the truth leads you. Lead along the path by JRR Tolkein and others, his reasoned path made him an ardent defender of the faith. His ability articulate what and why he believed led to him having an opportunity to speak to British soldiers and pilots who were routinely facing the prospect of death. These talks led to his thoughts being broadcast as a series on the BBC during WW2 and later compiled into Mere Christianity. If you're intellectual and want to know what Christians believe, pick up Mere Christianity. If you're not too much of a thinker, pick up this book.

The Unshakable Truth – Josh McDowell
Josh wrote one of the all time most popular defenses of the faith called More Than a Carpenter in 1977. He revisits those arguments in this book, but after 30 years of refining his approach. It was a decent read which I picked up after Josh McDowell came to speak at my church.

How Good is Good Enough – Andy Stanley
A freaking great question and a great book about how to answer it. Short too (200 or so pages). This book is a discussion of what separates Christianity from every other religion on the earth. Famed atheist Dan Dennett gave a TED talk where he made a policy proposition to make it mandatory to teach about every religion in the world. I agree with him, hiding the existence of other religions isn't fair or responsible. I think if this book was in that curriculum Dan's plan would not have his intended consequence of jading our youth to all religion. If you believe in the afterlife de-facto or if you're just aware of Pascal's Wager and think you should take some time to evaluate this stuff, this has to be the a question on your mind. Exactly how good is good enough, with all that's potentially riding on it it's worth the time to flip through this book.

Wednesday, December 4, 2013

Argument list too long

eric@glamdring-desktop:~/workspace/gorilla/sneed_biometrics$ cp trainingfaces/* ~/workspace/trainingfaces/
bash: /bin/cp: Argument list too long

What the!?
Grrrr...
cd faces
ls | while read a; do cp $a ~/workspace/trainingfaces/; done

Victory.

Wednesday, November 20, 2013

Software is getting better!

A friend made a joke about May's Law today. David May states his law as follows, “Software efficiency halves every 18 months, compensating for Moore’s Law.” Larry Page has made similar statements, of course referred to as Page's Law. Sure, it's profound, funny. and good marketing to attempt to make assertions about software always seeming to run slower. But the truth of software efficiency writ large is the opposite. Here's a report that addresses the assertion directly.

REPORT TO THE PRESIDENT AND CONGRESS DESIGNING A DIGITAL FUTURE: FEDERALLY FUNDED RESEARCH AND DEVELOPMENT IN NETWORKING AND INFORMATION TECHNOLOGY
http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf

Kurzweil's book "How to create a mind" tipped me off to this report and specifically this section on page 71.
Progress in Algorithms Beats Moore’s Law
Everyone knows Moore’s Law – a prediction made in 1965 by Intel co-founder Gordon Moore that the density of transistors in integrated circuits would continue to double every 1 to 2 years. Fewer people appreciate the extraordinary innovation that is needed to translate increased transistor density into improved system performance. This effort requires new approaches to integrated circuit design, and new supporting design tools, that allow the design of integrated circuits with hundreds of millions or even billions of transistors, compared to the tens of thousands that were the norm 30 years ago. It requires new processor architectures that take advantage of these transistors, and new system architectures that take advantage of these processors. It requires new approaches for the system software, programming languages, and applications that run on top of this hardware. All of this is the work of computer scientists and computer engineers.Even more remarkable – and even less widely understood – is that in many areas, performance gains due to improvements in algorithms have vastly exceeded even the dramatic performance gains due to increased processor speed.The algorithms that we use today for speech recognition, for natural language translation, for chess playing, for logistics planning, have evolved remarkably in the past decade. It’s difficult to quantify the improvement, though, because it is as much in the realm of quality as of execution time.In the field of numerical algorithms, however, the improvement can be quantified. Here is just one example, provided by Professor Martin Grötschel of Konrad-Zuse-Zentrum für Informationstechnik Berlin. Grötschel, an expert in optimization, observes that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later – in 2003 – this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms! Grötschel also cites an algorithmic improvement of roughly 30,000 for mixed integer programming between 1991 and 2008.The design and analysis of algorithms, and the study of the inherent computational complexity of problems, are fundamental subfields of computer science.
The report goes into more detail discussing research priorities. Those in an argumentative mood might point out that algorithms are just a subset of software and overall software efficiency has been decreasing. I've made this point before when writing and talking about the big data technologies. We've certainly seen a change in the way data technologies are developed. We've gone from cleverly conserving computing resources to squandering them creatively. But it's not as if we're doing the same thing but just less efficiently; whole new capabilities have been opened up. The data redundancy of hadoop file system (HDFS) means we can process larger sets of data and overcome hardware failures which are inevitable on large "compute hour" jobs. When you're employing thousands of disks or cores in your jobs, the chances of an individual failure are increased. The inefficiency is a risk mitigation strategy, storing the same data three times (by default) certainly isn't efficient, but it makes it possible to do very large jobs.

The data processing technology improvements are just one example; there are many like this across the board. Remember XML? Json is definitely more efficient. Machine learning implementations are getting faster. As a counter to the improvements, more people are attempting to misuse your personal resources, which may lead to things seeming sluggish if you're incautious. If you're wondering why your Windows operating system seems slower, that's a whole different story. I think that in closed code bases there might be more of an incentive to hang on to old things leading to inefficiency; but that's just conjecture based on personal experience, read more about my experiences on data munge. Maybe somebody will do a study on closed vs open execution speed over time. Until then, I'll hold suspect any piece of software where a large community of developers can't look at the code base and improve it. There's a law for that too.    



Saturday, November 16, 2013

Anonymous Web Scraping

Web scraping is the act of retrieving data from a web page with automated software. With a few lines of perl or python you can gather massive amounts of data or write scripts to keep an eye on web pages and alert you to changes.

There are a few reasons for not wanting this traffic to look like it's coming from your computer. Sometimes web admins or intrustion prevention systems will label automated traffic as malicious and respond by blocking your IP address. Other times, such as system monitoring, you just want to see what happens when you access certain pages from elsewhere on the Internet. Scraping, especially anonymously, is certainly something that can be abused. But I think there's enough good reasons to do it that I'm comfortable writing about this.

The oldest trick I know to remotely scrape web pages is to purchase or own a shell account. These things still exist, still mostly for IRC bots. Here's a list of 92 shell providers with locations in 13 countries. http://www.egghelp.org/shells.htm Somewhere along the way, IAAS providers such as Amazon Web Services and Rackspace made it easy to provision remote machines in a selected availability zone. They are as convenient as buying a shell, but more powerful since you have root access to your own operating system. Companies such as http://80legs.com/ have made it their business to help people crawl the web and gather data.  Various anonymizer proxies  are available, but not worth much because they get blacklisted so quickly and often have terms of use prohibiting scraping or bots. Lastly, if you're researching this topic, you need to be aware of illegal botnets. It's criminal, and of course I don't advocate it, but some people make a hobby of taking over large numbers of home computers and put them to work doing things or just being creepy.

Various ways to anonymize scraping:
1. Shell account
2. IAAS provisioned server
3. Scraping companies
4. Anonymizer proxies
5. Being a creep

These are still viable ways of gathering data without using your own IP address. But, I think I've found an easy, powerful, and convenient way that sidesteps the downsides of each of the approaches listed above. One of my projects required me to scrape some data fairly regularly, so I decided to start from scratch and see if I could eschew these traditional approaches and use the Tor network to avoid getting tagged and blocked by an Intrusion prevention system.

From their website: "Tor is free software and an open network that helps you defend against traffic analysis, a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security."

I've been toying with Tor for the last few years. It's been useful when getting around bad corporate VPN or routing problems. More than once I've been stuck in a corporate conference room or venue with poorly configured routing or bizarre filtering. Not a problem, Tor has always been able to get my traffic out to the open Internet even when the corporate networks have blacklisted commercial proxy providers. When most people use Tor all they see is the Tor browser bundle and the Vidalia interface. Underneath the hood, these two components interact with a Socks Proxy which is your interface to the greater Tor network. By default, if you're running Tor on your machine this proxy is exposed through port 9050 and can be accessed by anything that can use Socks. This is good news because Socks is pretty standard and libraries exist for most programming languages.

For our purposes we're not concerned with extreme privacy. If you are, please go to Tor's web site and download the latest version. We just want to use the onion routing network, so Ubuntu's packages are fine.

sudo apt-get install tor
sudo /etc/init.d/tor start

You're computer should be connected to the Tor network now and have a socks proxy listening on port 9050. If you're not behind a software or hardware firewall, you might want to enable some sort of authentication. Then you should go buy a cheap hardware firewall and install it. After that set aside time to re-evaluate the rest of your security practices and contemplate your approach to life in general. If you thought it was ok to have an unprotected computer connected to the Internet you're probably going wrong elsewhere.

ss -a | grep 9050
tcp    LISTEN     0      128          127.0.0.1:9050                  *:*


We'll use python to take advantage of that proxy. There's only one module that needs to be installed from a default Ubuntu 3.10 system. Here's the command to get it.

sudo apt-get install python-socksipy

Any scripts you've written with python and urllib2 are pretty easy to re-use anonymously. Here's the simplest working example.

#!/usr/bin/python
import socks
import socket
import urllib2

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket

url = 'https://check.torproject.org/'
data = urllib2.urlopen(url).read()
print data

By building out the above code you can anonymously scrape. Build an array of urls then create a loop to iterate on it and you're off to data heaven. But your exit node will always be the same IP address. The way to get around this is to launch multiple connections to the Tor network.  Tor can be initiated on the command line as such. You'll need to specify unique control ports, socks ports, Pids, and DataDirectories for each instance. It's trivial to write a shell script to do this 5, 10, or even 100 times with unique values. Proceed with caution. I've noticed that Tor nodes can chew up a lot of CPU time. By using this trick and varying which entry node your scraper uses you can distribute your traffic across <n> internet IP addresses pretty easily.

tor --RunAsDaemon 1 --ControlPort <control port> --PidFile tor<n>.pid --SocksPort <socks port> --DataDirectory data/tor<n>

Please share the link to this page if you found it useful.

Updated Dec 2014:
Although it is possible to scrape websites using Tor, please don't engage in behavior that would get the exit nodes blacklisted. There are only a finite number of them and they are a valuable resource for the community.  As another option, I've written a very lightweight http scraping proxy which can downloaded from my github account here: https://github.com/ericwhyne/http-ricochet if you're scraping behavior may end up in blacklists, it might be best to use something like this on ephemeral hosts that won't damage community resources.











 

Friday, November 15, 2013

Memory vs Disk for Data Platforms

Some interesting charts. I wrote this in response to a colleague asking my thoughts on a paper Intel released.
 http://www.intel.com/content/www/us/en/it-management/intel-it-best-practices/configuring-in-memory-bi-platform-for-extreme-performance.html

Graph of memory and Disk prices:
http://www.jcmit.com/mem2013.htm


Corporate Data Growth:
http://2.bp.blogspot.com/-UWjjCrgY1MU/UmxzEmfk-CI/AAAAAAAAAdE/VvgvjGnCYyg/s1600/Screen+Shot+2013-10-26+at+7.56.52+PM.png




Of the two charts above, only one can be represented in a browser without resorting to the use of a logarithmic scale.

I had to copy 10Gb of files to a spinning disk last night on my desktop (something now to be avoided whenever possible). It went at a blistering 10Mbs. The NIC in the machine operates at 1Gbs, my Internet connection is 75Mbs. Something is wrong here.

Cool new in memory projects that are gaining momentum:
https://github.com/amplab/tachyon
http://spark.incubator.apache.org/

The important chart:
http://spark.incubator.apache.org/images/spark-lr.png

After looking at the evidence I'll comfortably make the assertion that disk is dying as a medium for anything other than archival storage. This is a different strategy than cache optimization of RDBMS and related technologies. However, optimizing code and algorithms to avoid Cache misses is stil cool and useful.
http://stackoverflow.com/questions/18559342/what-is-a-cache-hit-and-a-cache-miss-why-context-switching-would-cause-cache-mi

Because corporate data growth is progressing slower than memory is getting cheaper and more plentiful, it makes sense to seriously evaluate using architectures that have big enough memory to hold the entire data sets.

I suspect in the foreseeable future memory will creep closer to the cores (much bigger caches) or the cores will creep closer to the memory (new architectures?). Although this hasn't seemed like something folks have been looking into because of the current lack of software written to run well with these new capabilities.

Sunday, November 10, 2013

Baby monitor fix



Our Samsung baby monitor has seem some hard use and stopped working due to a broken power connector on the screen unit. Here's the model: http://www.amazon.com/Samsung-SEW-3022WN-Ezview-Baby-Monitor/dp/B004U5BSN4

When it broke we didn't really need it as much. Our son was old enough to be fine on his own in the crib; no worries about him not being able to roll over or push a blanket off his face. Everything changed last week when our son managed to crawl out of his crib. Today I gathered up the parts from the monitor and sat down to do my first bit of soldering in our new house.

I managed to find everything but one of the power adapters. A glance at the output voltage of the remaining wall wart let me know that the board is expecting to get 5V and 1A. That was great news since that's right in line with the power levels of most USB chargers. I ran off to my bin of spare cables and came back with a mini-b usb cable and set to work determining the pinout and cutting it apart. I was in full "mad-scientist mode" instead of contemplative "engineer mode"; this is when I make most of my mistakes. I'll mention how I screwed up later, but it started with chopping this cable up prematurely.

My collection of chargers range from 300mA to 1A. I ended up having an 850mA charger handy to test with; close enough (remember mad scientist mode here...). I verified everything on the monitor PCB by poking around with a multimeter on it's continuity setting. Most flexible cables, such as USB cables are made of lots of smaller pieces of wire twisted together. I hate trying to solder cables like that directly to a printed circuit board because there's always the potential of a stray wire strand shorting a connection. Digging through my parts bin I found some integrated circuit boards and cut one apart. I could then take my time soldering the stranded USB cable to this instead of directly to the monitor's PCB.

With my USB cable soldered to the IC PCB piece, I hot glued the IC PCB piece to the monitor circuit board with the IC side pin holes lining up perfectly over the broken power connector holes. Note in the picture below that I unplugged the lithium battery before I did this. Accidentally shorting rechargeable batteries can cause explosions or fires. Once the hot glue cooled, I placed a solid wire down through the hole and held the soldering iron against it. As the wire heated, it melted the solder it was touching below. After removing the heat, the solid wire was permanently connected to the monitor's board. I clipped the wires short, soldered the top to my IC PCB and verified everything was still good to go with the multimeter's continuity setting again.


A final inspection, a little more hot glue, and I plugged in the power for the big test. It works! Great!



I clipped a hole to allow for the cable to come out of the case, then hot glued everything like crazy so that strain on the cable won't pull on any of the circuit boards.


If you don't have a hot glue gun, they are cheaper than you'd expect.


I left 2 or 3 ft of cable since I didn't expect the battery to still hold a charge, meaning we'd only be able to use it when plugged in anyway. I was wrong, but it's only a matter of time until the battery craps out, so I don't feel too bad about it.

That mistake I mentioned earlier... here's the details.

In my haste to fix the monitor I didn't really take a close look at the rest of the board. I just assumed that the section of the board with the broken barrel power connector and battery connections would be the only way to power the board up. After the fix when I was putting everything back together I noticed a mini-b usb port on the opposite corner of the board! Unfortunately, the mini-b usb that I had just chopped up to complete this fix was the only one I had. This fix could have been as easy as cutting away the case around the mini-usb port and plugging in the cable, saving me the time and parts and making for a much cleaner fix. I may never know. I cut the case away anyway; maybe I'll find another mini-b usb cable somewhere and give it a try.


Saturday, November 2, 2013

Axe Handle

I found two forged steel double bit axe heads pitted with rust and forgotten on a shelf in a barn. It's hard to find forged steel axes (now expensive and rare) so I took the time to clean the smaller of the two and put a nice handle on it.


Most modern axes are cast. During casting molten metal is poured into a clay or sand mould and left to cool before final shaping is done by removing excess metal by filing or machining. When casting, the metal in the axe ends up being homogeneous and of a type that can reasonably hold an edge, somewhat tough and somewhat brittle. Trade-offs are made. In contrast, forged axes are created by heating and pounding metal into shape. When done in mass production, as this axe head was, it's usually not a person doing the pounding but rather a giant mechanical hammer. Forging requires that the metal be somewhat malleable (aka tough) so it won't tear during the forging process. Since malleable steel doesn't hold an edge, a more brittle piece of steel with a higher carbon content is forge welded to the edges of blades. Metal also has "grain" much like wood does. Forging metal stretches and aligns this grain with the shape of the object. The forging process results in a nearly indestructible axe head that will keep a razor edge for a long time. I know, axe heads aren't necessarily known for being fragile, but realistically there is no chance of hair line fractures in the metal near the handle since it is a tougher metal in the center. The axe can also be made thinner and lighter since the center part won't crack if misused as a hammer or when accidentally striking a rock.  For a more in depth explanation of these concepts, check out this great video by Ben Krasnow on heat treating metal. I took some care to avoid unecessary heating of the metal to avoid changing any of the temper.

If you spend any time shopping for a serious axe for wilderness treks or
timber work, you'll note they are all forged and cost several hundred dollars. Of course, if you don't care about any of this you can go to home depot with 30$ in your pocket and walk away with a cast axe to thump away at stumps in your yard. It will be over-built to compensate for the brittle metal and not keep an edge very long, but they're great for banging off rocks while cutting roots. I have one of those sitting in the corner of my garage covered in mud. This axe will lead a very different life. There is a maker's mark on one side of the axe and a Flint Edge logo on the other. Some online searching and I discovered that the latest this axe head might have been forged was 1949, but it could be a few decades older. All this considered, I was excited at the chance to bring this beauty back to life and own a quality unique axe that's also a piece of history.


The lines near the edge of the blade edge (visible in the image to the right) are where the high carbon steel was forge welded to the rest of the axe head. Both pieces of steel were heated to near melting, covered in a flux (common household borax works), and pounded together. I like that the lines are so clear on this axe; they speak to what it is.  The red oak should be better than the osage at resisting over-strikes, and I think the osage is a little lighter and more flexible than red oak. If I ever do any serious timbering it will be with a chainsaw, so I made the overall length more like a smaller forest axe as opposed to a long felling axe. It's 29 inches long. I think this is the perfect size to put in the truck for longer camping trips, fishing, or in the canoe. Short enough to pack or split kindling semi-safely, but long enough to get a good swing at something big.

The video starts off with me showing how to pound off an axe head without destroying it. I rarely see an older axe head that isn't deformed from hammer on blade contact. A safer and less destructive way is to drill it out and use scrap wood to protect the blade from the hammer. If you look closely you'll notice the head I remove is on backwards. I didn't record the original separation and it was easier to put it back on that way. Both of these axe heads have an aggressive reverse taper inside them. I ended up compensating for this by pounding wood wedges soaked in wood glue into the final axe. The glue soaked wood slivers nicely filled up the top of the axe and filled out the rest of my shim slot. It was easily trimmed back with a saw blade and made uniform with a quick touch from an angle grinder with a wire wheel. This head will never come loose, move, or let moisture get under it. I made the metal shim by putting a taper and some grooves on a piece of scrap metal (forgot to record that). The process of banging on the head and the shims ended up destroying the bottom of the axe handle. That was another reason to put the extension on with the box joint jig. I think this concession wouldn't do for a real axe connoisseur or even for a serious axe; but I did the work and I know the joints are solid. I don't think it will break even under severe use. It's far away from any critical stress points in the handle. Plus, I think the contrast makes it look good and will leave most woodworkers puzzling over how I was able to make such a clean joint with those dimensions. If I ever do this again I'm going to make the handle longer and wider than needed and cut it to length and shape it after the head is on and the shims are in. Or spend more time custom fitting the head; I may have made this all go together a little too tightly.

Functionally and aesthetically I think this turned out to be a great tool. Axes are most useful when chopping the sap free wood of fall and winter while camping. The leaves are turning, maybe I'll get a chance to test it out soon.




Update: I put a handle on the other axe head today. I ended up not sanding it, just wire wheeled the rust off. For ten dollars at tractor supply I scored a hickory axe handle with hickory shim. It took me less than 10 minutes to fit it with a file then glue and pound in the shim. Practical, but less fun than making one from scratch.  After the glue dried, I took the time to sand off the polyurethane coating and cover it in some danish oil. The reason for an oil finish is repair-ability as opposed to durability. If I nick the handle, I can just dab some oil on it and it's protected; surface coatings are not as resilient or easily repaired.

While sharpening I noticed that the steel was much harder than any of my other axe heads. The file didn't want to bite into the metal while sharpening. A skittering file is a common way to diagnose a hard steel. I ended up using my orbital sander to shape the initial edge. I'll wander into the garage and finish off the edge later this week. Here's a picture:


Saturday, October 26, 2013

Semi-Useless Machine

Last fall I bought an arduino with intentions to build all kinds of great projects. Family, work, and other projects got in the way. But I did manage to knock out my version of a useless machine before getting distracted. I brought it down from the shelf today in an attempt to distract my son; making it useful for about five minutes. In the second half of the video he tries twisting the switch instead of just flipping it... he's getting more systematic when he encounters new things. Pretty soon he'll be opening doors.

It was a good project to run through the steps of getting code onto the system and interfacing with inputs and servos. There's a bug in my code that drains the battery and causes the servo to click every few minutes. I never fixed it, then lost the code, but I know how to avoid it next time. Also I have a github account now, so code loss should be a thing of the past too. https://github.com/ericwhyne





Wednesday, October 16, 2013

Understanding Virtualization


@ericwhyne
Originally posted at  http://datatactics.blogspot.com/

The word virtual means to be very close to being something without actually being it. Virtualization is one of those terms that are thrown about in our industry that usually only gets virtually close to being understood. This article is a short primer on what virtualization is and describes some of the surrounding vocabulary you need to understand in order to get a stronger grasp on it's more nuanced aspects.

In order to better understand virtualization, lets first review a few basics about computers. Most modern computers have three fundamental things: persistent memory, temporary memory, and processors. The bits that make up the program and the bits that make up the data it operates on are stored and processed alongside each other in memory. This is a basic description of what's called the Von Neumann architecture and there's a ton of other stuff in there augmenting this approach on modern computers. Programs are loaded from persistent memory into temporary memory for execution. The memory location where the program starts pointed at and the processor reads chunks of it and acts on the instructions contained therein. The actions usually involve the modification of other memory locations, like the memory that holds the image on your screen, or the memory location that sends information over your network card. Processors have something called an instruction set. This is the list of valid “go do this” instructions that a given processor will understand. When you compile software from human readable instructions called source code it's converted into machine readable instructions called machine code. It's all very simple, but gets more complicated by the fact that all processors don't have the same instruction sets. If you remember one thing out of this paragraph, remember this, processors aren't all the same. Different families of processors are called processor architectures. Software compiled to machine code for one architecture won't run on any of the others and this created some big hurdles to making software that can be widely distributed. Despite processors being slightly different, we expect the software we write and compile to run across them without error.

Some clever software engineers came up with a solution to this compatibility problem. The general idea was to create a virtual machine. A virtual machine can be logically similar to a real machine, but not in most cases. In most cases, it's really just more software that acts as a translator from a standard set of instructions it knows to the various instruction sets of different physical computers. Software could then be written to this virtual machine which would be uniform and then the virtual machine could be written for each of the processor architectures. Recall from the previous paragraph that when you compile software to a specific architecture, it's called machine code. When you compile software to run on a virtual machine it is called byte code or portable code (p code for short). Byte code is a bit more general purpose since it's made to be interpreted by the uniform constructs of the virtual machines, and hence does run slower.

The piece of software that handles the virtual processor is called a hypervisor. The most famous hypervisor is probably the Java instantiation. It spawns Java Virtual Machines (JVMs) which are the magic that makes the same compiled Java files run on multiple processor architectures or operating systems. In this way, Java achieved something called portability. You could write once, and run anywhere; albeit originally it ran kind of slow. This problem was overcome by implementing a technique called Just In Time compiling or JIT. JVMs that do JIT, rather than translating to machine code at run time, compile the byte code to machine code which runs much faster. But since compiling is often a time consuming task itself, there's a prioritization that has to take place. JIT is usually just done for pieces of the code that run a lot during the execution, inner loops and the such.

If you're like most people in the information technology industry we've probably already drifted away from your preconceived notions of virtualization. Most people don't think of byte code vs machine code when you mention virtualization. I'm about to clear things up a bit for you before I muddy things up again. Hypervisors come in two types. Type 1 hypervisors just act as an arbitrator directly to the hardware. Type 2 hypervisors are the type that I think most people think of when they think of virtualization. In this case we can go beyond just acting as an abstract interface to an architecture and act as an arbitrator to a host operating system's resources. Type 2 hypervisors are how you can run a windows Virtual Machine on a Linux host operating system, or vice versa. With the advent of multi-core machines the benefits of being able to host multiple operating systems on one piece of hardware has been a boon of productivity. Virtualizing whole operating systems instantiations has enabled configuration management at the OS level, allowing operating systems to be configured in a specific way and rapidly deployed or removed from hardware. It's a lot easier to do configuration management at the machine level if you have a working machine to do it with. With the advent of multiple processor and multiple core machines and processor architectures designed to support virtualization (I know, crazy right?) it has become efficient to run multiple instances of operating systems simultaneously on the same piece of hardware.

There are no good pictures of virtualization.
Here's a picture of my kid playing in sand.
Another great side effect of both types of virtualization is called sandboxing. If you want to give an untrusted user or program resources, but want to be able to easily destroy any changes made and completely block them off from the rest of the system, with virtualization you can truly achieve that. Just think of a child playing in a sandbox. It doesn't matter what they do to the sand, you can always just set it back to whatever state it was before they got in there; rake the sand per se. The analogy stops when the child buries your car keys (I have a toddler at home), so forget about all the chaos normal children can undertake. Sandboxes are for untrusted users or programs and give you the ability to easily wipe away anything they could possibly do even if they have full permissions to the sandboxed environment.

Now that you think you understand it all, time to muddy things up a bit again. Having virtualized operating systems is great, but duplicating an entire operating system is still a lot of overhead. Creating a virtual machine that pretends to be actual hardware or has a bunch of caveats or exceptions is hard work for a computer. Wouldn't it be easier if we could just use the overhead work already done by the host operating system and just have the tenant machines implement just the new stuff we wanted or at least just be properly sandboxed? Enter the advent of the lightweight virtual systems, most notably Linux containers. With some clever modifications to the OS kernel (the part of the OS that interfaces with the hardware) lightweight virtual systems achieve something called resource isolation. This is still a topic with a lot of change happening, but I think in most production environments we are going to see containers overtaking more traditional Type 2 hypervisors in most circumstances within a short timeframe.

If this article was helpful, don't forget to comment or share it. We will keep writing this stuff it people keep finding it useful.

Saturday, October 12, 2013

Drill press of danger


This is a picture of a drill press owned by a family friend in Pennsylvania. I always enjoyed looking at it, if not just for the sheer engineering comedy of it. It puts the features of modern tools into perspective and shows just how far tool technology has come. This is what they had to hack with in the old days. It's one of those tools that you have to treat like a dangerous wild animal... avoid walking by it with loose clothing, taking a step back before you hit the power switch, and making sure you have a clear escape path. Tools like this don't have "users", they have "potential victims".

Both the drill mechanism and table are solidly mounted to the heavy steel I-Beam. You'll note that the I-Beam is not part of the building, it's all part of the drill press. You could run over it with a dump truck and it would maintain alignment. Unfortunately getting your work piece within reach of the drill bit requires stacking whatever scrap wood you have laying around. This means the working surface is dependant on each of the boards being perfectly dimensioned. There are no perfectly dimensioned pieces of wood within miles of this garage. So despite the drill press itself having the torsional stability of a battle ship hull, you will always get a rickety and misaligned work surface. Not that this machine ever gets used that much.

It's driven by an electric motor mounted to the ceiling that spins the giant fly wheel on it's left side. The bit is lowered into the work piece by reaching up and turning the crank all the way at the top of the machine. This is good, because if it had a lever it would probably pull out the bent nails that secure the I-Beam to the garage wall. The motor is separately anchored to the garage roof, but if the nails let loose I don't think the motor belt would support the weight of the machine even if it were not dry-rotted. In either case the flywheel would ensure enough gear spinning momentum to do some real damage to the victim (no longer a potential victim in this circumstance). The original motor was 1/4 horsepower. But not like modern electric motors, this one actually looks like 1/4th of a real horse and probably weighs as much. In a battle of torque I'd put it up against any modern day 2hp motor. Thanks to the contribution of the dry-rotted belt and misaligned wheels, it even kind of made a galloping sound while running. Just turning this thing on is like travelling back in time to a more dangerous era. At some point it was unable to be supported by the garage roof any more and it was replaced with a smaller and more benign motor. But I still get a kick out of somebody having the courage to actually stand under such a monster.

Using the machine requires stripping down to a snug fitting t-shirt and making sure you are out of the line of fire of any lashing belts or shattering gears when you turn it on. On the left side, getting a shirt sleeve caught in the flywheel belt would whip your whole arm into the gear mechanism. On the right side, the gears seem designed to turn fingers into hamburger. I've inspected the machine several times for design clues indicating there was once a cover on these gears, but there are none. It was built this way, with the logo even being proudly stamped on the naked frame right under the hand-destroying gears ...another indication this tool was designed before liability lawsuits were invented. Designing such a maiming device is one thing, having the courage (or lack of foresight?) to paste your contact information right on it is another level of audacity. I can just imagine some newly left-handed machinist swearing to do harm to the designer and setting off in their Model-T to go find them. Maybe that happened, who knows. We can imagine. 

Despite it's shortcomings, I'm going to make sure this tool doesn't get thrown on a scrap pile during my lifetime. Maybe some misguided person will create a museum where people can gawk at dangerous power tools some day and we'll send it off to a good home there. In any case, there will always be the space in some garage in the family to keep this shrine of engineering farce for us to look at and contemplate.