Friday, January 23, 2015

Velociraptor

New project.

Velociraptor means "swift grab" and this software is a parallel tool that uses your AWS EC2 account to swiftly fetch lots of web content.

https://github.com/ericwhyne/Velociraptor

Velociraptor takes a list of urls and launches parallel AWS EC2 instances which then run wget in parallel to fetch the html and images into Web Archive .warc files and store them in an S3 Bucket. Parallelism is managed by GNU Parallel and AWS is managed by AWS CLI. This project is not associated with AWS and may use other cloud service providers in the future.