Saturday, November 16, 2013

Anonymous Web Scraping

Web scraping is the act of retrieving data from a web page with automated software. With a few lines of perl or python you can gather massive amounts of data or write scripts to keep an eye on web pages and alert you to changes.

There are a few reasons for not wanting this traffic to look like it's coming from your computer. Sometimes web admins or intrustion prevention systems will label automated traffic as malicious and respond by blocking your IP address. Other times, such as system monitoring, you just want to see what happens when you access certain pages from elsewhere on the Internet. Scraping, especially anonymously, is certainly something that can be abused. But I think there's enough good reasons to do it that I'm comfortable writing about this.

The oldest trick I know to remotely scrape web pages is to purchase or own a shell account. These things still exist, still mostly for IRC bots. Here's a list of 92 shell providers with locations in 13 countries. http://www.egghelp.org/shells.htm Somewhere along the way, IAAS providers such as Amazon Web Services and Rackspace made it easy to provision remote machines in a selected availability zone. They are as convenient as buying a shell, but more powerful since you have root access to your own operating system. Companies such as http://80legs.com/ have made it their business to help people crawl the web and gather data.  Various anonymizer proxies  are available, but not worth much because they get blacklisted so quickly and often have terms of use prohibiting scraping or bots. Lastly, if you're researching this topic, you need to be aware of illegal botnets. It's criminal, and of course I don't advocate it, but some people make a hobby of taking over large numbers of home computers and put them to work doing things or just being creepy.

Various ways to anonymize scraping:
1. Shell account
2. IAAS provisioned server
3. Scraping companies
4. Anonymizer proxies
5. Being a creep

These are still viable ways of gathering data without using your own IP address. But, I think I've found an easy, powerful, and convenient way that sidesteps the downsides of each of the approaches listed above. One of my projects required me to scrape some data fairly regularly, so I decided to start from scratch and see if I could eschew these traditional approaches and use the Tor network to avoid getting tagged and blocked by an Intrusion prevention system.

From their website: "Tor is free software and an open network that helps you defend against traffic analysis, a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security."

I've been toying with Tor for the last few years. It's been useful when getting around bad corporate VPN or routing problems. More than once I've been stuck in a corporate conference room or venue with poorly configured routing or bizarre filtering. Not a problem, Tor has always been able to get my traffic out to the open Internet even when the corporate networks have blacklisted commercial proxy providers. When most people use Tor all they see is the Tor browser bundle and the Vidalia interface. Underneath the hood, these two components interact with a Socks Proxy which is your interface to the greater Tor network. By default, if you're running Tor on your machine this proxy is exposed through port 9050 and can be accessed by anything that can use Socks. This is good news because Socks is pretty standard and libraries exist for most programming languages.

For our purposes we're not concerned with extreme privacy. If you are, please go to Tor's web site and download the latest version. We just want to use the onion routing network, so Ubuntu's packages are fine.

sudo apt-get install tor
sudo /etc/init.d/tor start

You're computer should be connected to the Tor network now and have a socks proxy listening on port 9050. If you're not behind a software or hardware firewall, you might want to enable some sort of authentication. Then you should go buy a cheap hardware firewall and install it. After that set aside time to re-evaluate the rest of your security practices and contemplate your approach to life in general. If you thought it was ok to have an unprotected computer connected to the Internet you're probably going wrong elsewhere.

ss -a | grep 9050
tcp    LISTEN     0      128          127.0.0.1:9050                  *:*


We'll use python to take advantage of that proxy. There's only one module that needs to be installed from a default Ubuntu 3.10 system. Here's the command to get it.

sudo apt-get install python-socksipy

Any scripts you've written with python and urllib2 are pretty easy to re-use anonymously. Here's the simplest working example.

#!/usr/bin/python
import socks
import socket
import urllib2

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket

url = 'https://check.torproject.org/'
data = urllib2.urlopen(url).read()
print data

By building out the above code you can anonymously scrape. Build an array of urls then create a loop to iterate on it and you're off to data heaven. But your exit node will always be the same IP address. The way to get around this is to launch multiple connections to the Tor network.  Tor can be initiated on the command line as such. You'll need to specify unique control ports, socks ports, Pids, and DataDirectories for each instance. It's trivial to write a shell script to do this 5, 10, or even 100 times with unique values. Proceed with caution. I've noticed that Tor nodes can chew up a lot of CPU time. By using this trick and varying which entry node your scraper uses you can distribute your traffic across <n> internet IP addresses pretty easily.

tor --RunAsDaemon 1 --ControlPort <control port> --PidFile tor<n>.pid --SocksPort <socks port> --DataDirectory data/tor<n>

Please share the link to this page if you found it useful.

Updated Dec 2014:
Although it is possible to scrape websites using Tor, please don't engage in behavior that would get the exit nodes blacklisted. There are only a finite number of them and they are a valuable resource for the community.  As another option, I've written a very lightweight http scraping proxy which can downloaded from my github account here: https://github.com/ericwhyne/http-ricochet if you're scraping behavior may end up in blacklists, it might be best to use something like this on ephemeral hosts that won't damage community resources.