Raw File
Tip revision: 812c84f8971014964daa521f8e95daa8cc2b459f authored by Richard Smith on 28 May 2014, 22:38:53 UTC
Release v0.1.2.
Tip revision: 812c84f

[![NPM version](][npm]
[![license MIT](][license]
[![Build Status](][travis]
[![Dependency Status](][gemnasium]


A very simple declarative, headless scraping CLI.

`quickscrape` is a simple command-line tool for scraping websites using declarative scraper definitions and a headless browser.

This approach has some benefits compared to existing scraping systems:
- Declarative scraper definitions allows large collections of scrapers to be created with no programming.
- Scraping through a headless browser allows handing off page rendering complexity to the browser, where it belongs. The scraping software sees nice rendered HTML.

Our headless browsing is done by driving [PhantomJS]( with [CasperJS]( via a node-bridge with [SpookyJS](

At the moment, `quickscrape` is limited to scraping a single URL at a time.

**NOTE**: This is pre-alpha software. It works for some very specific test-cases and is under active development. Please wait until we're in beta to report issues.

## Installation

### Quick-start

Install [NodeJS](, [PhantomJS]( and [CasperJS](, then install the module with: `npm install --global quickscrape`.

### Not-so-quick-start

#### OSX

With [Homebrew](, install dependencies:

brew update && brew install node phantomjs
brew install casperjs --devel

Install quickscrape

`npm install --global quickscrape`

#### Debian

With apt-get, install dependencies:

`apt-get update && apt-get install nodejs phantomjs`

Install final dependency and quickscrape

`npm install --global casperjs quickscrape`

## Documentation

Run `quickscrape --help` from the command line to get help:


  Usage: quickscrape [options]


    -h, --help            output usage information
    -V, --version         output the version number
    -u, --url <url>       URL to scrape
    -s, --scraper <path>  Path to scraper definition (in JSON format)
    -o, --output <path>   Where to output results (directory will created if it doesn't exist) [output]


You must provide scraper definitions in the format used in the [ContentMine journal-scrapers](

## Examples

// grab some pre-cooked scraper definitions
git clone

// scrape some journal PDFs
quickscrape \
  --url \
  --scraper journal-scrapers/peerj.json \
  --output peerj-384

quickscrape \
  --url \
  --scraper journal-scrapers/plosone.json \
  --output plos-waspfaces

## Contributing

We are not yet accepting contributions, if you'd like to help please drop me an email ( and I'll let you know when we're ready for that.

## Release History

- ***0.1.0*** - initial version with simple one-element scraping
- ***0.1.1*** - multiple-member elements; clean exiting; massive speedup
- ***0.1.2*** - ability to grab text or HTML content of a selected node via special attributes `text` and `html`

## License
Copyright (c) 2014 Richard Smith-Unna  
Licensed under the MIT license.
back to top