There are a few reasons for not wanting this traffic to look like it's coming from your computer. Sometimes web admins or intrustion prevention systems will label automated traffic as malicious and respond by blocking your IP address. Other times, such as system monitoring, you just want to see what happens when you access certain pages from elsewhere on the Internet. Scraping, especially anonymously, is certainly something that can be abused. But I think there's enough good reasons to do it that I'm comfortable writing about this.
Various ways to anonymize scraping:
1. Shell account
2. IAAS provisioned server
3. Scraping companies
4. Anonymizer proxies
5. Being a creep
These are still viable ways of gathering data without using your own IP address. But, I think I've found an easy, powerful, and convenient way that sidesteps the downsides of each of the approaches listed above. One of my projects required me to scrape some data fairly regularly, so I decided to start from scratch and see if I could eschew these traditional approaches and use the Tor network to avoid getting tagged and blocked by an Intrusion prevention system.
From their website: "Tor is free software and an open network that helps you defend against traffic analysis, a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security."
browser bundle and the Vidalia interface. Underneath the hood, these two components interact with a Socks Proxy which is your interface to the greater Tor network. By default, if you're running Tor on your machine this proxy is exposed through port 9050 and can be accessed by anything that can use Socks. This is good news because Socks is pretty standard and libraries exist for most programming languages.
For our purposes we're not concerned with extreme privacy. If you are, please go to Tor's web site and download the latest version. We just want to use the onion routing network, so Ubuntu's packages are fine.
sudo apt-get install tor
sudo /etc/init.d/tor start
You're computer should be connected to the Tor network now and have a socks proxy listening on port 9050. If you're not behind a software or hardware firewall, you might want to enable some sort of authentication. Then you should go buy a cheap hardware firewall and install it. After that set aside time to re-evaluate the rest of your security practices and contemplate your approach to life in general. If you thought it was ok to have an unprotected computer connected to the Internet you're probably going wrong elsewhere.
ss -a | grep 9050
tcp LISTEN 0 128 127.0.0.1:9050 *:*
We'll use python to take advantage of that proxy. There's only one module that needs to be installed from a default Ubuntu 3.10 system. Here's the command to get it.
sudo apt-get install python-socksipy
Any scripts you've written with python and urllib2 are pretty easy to re-use anonymously. Here's the simplest working example.
socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS4, "127.0.0.1", 9050, True)
socket.socket = socks.socksocket
url = 'https://check.torproject.org/'
data = urllib2.urlopen(url).read()
By building out the above code you can anonymously scrape. Build an array of urls then create a loop to iterate on it and you're off to data heaven. But your exit node will always be the same IP address. The way to get around this is to launch multiple connections to the Tor network. Tor can be initiated on the command line as such. You'll need to specify unique control ports, socks ports, Pids, and DataDirectories for each instance. It's trivial to write a shell script to do this 5, 10, or even 100 times with unique values. Proceed with caution. I've noticed that Tor nodes can chew up a lot of CPU time. By using this trick and varying which entry node your scraper uses you can distribute your traffic across <n> internet IP addresses pretty easily.
tor --RunAsDaemon 1 --ControlPort <control port> --PidFile tor<n>.pid --SocksPort <socks port> --DataDirectory data/tor<n>
Please share the link to this page if you found it useful.
Updated Dec 2014:
Although it is possible to scrape websites using Tor, please don't engage in behavior that would get the exit nodes blacklisted. There are only a finite number of them and they are a valuable resource for the community. As another option, I've written a very lightweight http scraping proxy which can downloaded from my github account here: https://github.com/ericwhyne/http-ricochet if you're scraping behavior may end up in blacklists, it might be best to use something like this on ephemeral hosts that won't damage community resources.