Skip to content

How To: Scrape Web Pages

Dotan J. Nahum edited this page Oct 12, 2013 · 4 revisions

What's special about scraping Web pages that's different from log processing?

I/O Bound Jobs

It is I/O bound. I/O bound means that the job at hand not only depends on your machine and specifically your CPU, it most commonly means that you depends on other people's machine. And, well, other people's machine suck.

  • Latency
  • Transfer speed
  • Failures
  • Corruption

All these influence how your job will be executed.

Scaling I/O

If you used a single thread, or a single process, then naively they would block while waiting for I/O, or fail, or both block and fail and present you with corrupt data :(.

If this took 1 second, then you have a 1 req/s pipeline in your hands.

In Ruby, there's no silver bullet other than making more pipelines (let's ignore evented frameworks for now) - more threads or more processes and Sneakers is designed to scale both.

Building a Scraper

So same as with the How To: Do Log Processing example, let's outline a worker:

require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'

class WebScraper
  include Sneakers::Worker
  from_queue :web_pages

  def work(msg)
    doc = Nokogiri::HTML(open(msg))
    page_title = doc.css('title').text

    worker_trace "Found: #{page_title}"
    ack!
  end
end

However, since this worker does I/O, it will by default open up 25 threads for us. What if we want more?

require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'

class WebScraper
  include Sneakers::Worker
  from_queue :web_pages,
             :threads => 50,
             :prefetch => 50,
             :timeout_job_after => 1

  def work(msg)
    doc = Nokogiri::HTML(open(msg))
    page_title = doc.css('title').text

    worker_trace "Found: #{page_title}"
    ack!
  end
end

This means we set up 50 threads that will all do I/O for us at the same time. A good practice is to set up a prefetch policy against RabbitMQ of at least the amount of threads involved.

We also want to timeout super-fast; a timeout of 1 second means a thread can only be held up to 1 second, and this whole thing will generate at worst 50 req/s (worst being all jobs failing and timeouting on us).

Resource Starvation

If you are thinking of adding a persistence layer here (for example, for saving the page titles), note that the fact that Sneakers opens up to so many threads and so many processes, may mean that your database driver won't keep up. You'll need to make it accept more concurrency - check Performance for more on this.