Art and Code Archive Pages Categories Tags

Scraping

08 February 2012

At Edit Huddle, we need a way to confirm that the users who attempt to register a blog are actually authorized to do so.

To this end, we issue users a unique blog key upon registration. This key is used to query our server for the Edit Huddle plugin whenever a reader visits a blog. It is this key that we scrape for to determine whether or not to activate the Edit Huddle plugin on a blog.

Necessarily, we want our scraper to be fast. While most of our site uses Django, I chose to write a one-off scraping utility in Haskell.

Implementation

scrape-for-key is written in Haskell using the Haskell XML Toolbox (HXT) and curl.

The code is 106 lines (including getOpts stuff) hacked together from some HXT examples. Primarily, we want to

  1. Fetch the page (parsing it as HTML)
  2. Filter the src attributes of any script elements in document (see selectScriptTagSrc)
  3. Return with success if one of these calls our script with the appropriate key; otherwise, fail (see containsKey)
atTagCase tag = deep (isElem >>> hasNameWith ((== tag') . upper . localPart))
  where tag' = upper tag
        upper = map toUpper

selectScriptTagSrc = atTagCase "script" >>> getAttrValue "src" -- Step 2

containsKey key src = case parseURI src of -- Step 3
  Nothing -> False
  Just uri -> case uriAuthority uri of
    Nothing -> False
    Just uriAuth -> "domain.com" `isSuffixOf` uriRegName uriAuth
                 && "script.js" `isSuffixOf` uriPath uri
                 && ("key="++key) `isInfixOf` tail (uriQuery uri) -- `tail` trims the leading '?' from the uriQuery

Our main handles step 1 above, where we fetch the page before threading it through steps 2 and 3:

(_, body) <- curlGetString blogURL [CurlFollowLocation True, CurlMaxRedirs 5]
tags <- runX (parseHTML body >>> selectScriptTagSrc)
forM_ tags (\src -> if containsKey blogKey src then  exitWith ExitSuccess else return "")
exitFailure

Performance

I used the Perl utility dumbbench to test both scrape-for-key and curl. In this way I can estimate the time that HTML parsing and searching takes compared to fetching the pages.

First scrape-for-key:

$ dumbbench -v -p 0.001 -- bin/scrape-for-key -b http://edithuddle.com/blog/?p=27 -k tlwnsjgrv9cytnz1
cmd: Running initial timing for warming up the cache...
cmd: Running 20 initial timings...
cmd: Iterating until target precision reached...
cmd: Ran 542 iterations (52 outliers).
cmd: Rounded run time per iteration: 2.0316e-01 +/- 2.1e-04 (0.1%)
cmd: Raw:                            0.203163616832649 +/- 0.00020541722587663

Then curl:

$ dumbbench -v -p 0.001 -- curl http://edithuddle.com/blog/?p=27 -s -o /dev/null
cmd: Running initial timing for warming up the cache...
cmd: Running 20 initial timings...
cmd: Iterating until target precision reached...
cmd: Ran 759 iterations (8 outliers).
cmd: Rounded run time per iteration: 1.8028e-01 +/- 1.8e-04 (0.1%)
cmd: Raw:                            0.180279872073403 +/- 0.000181318786684558

Analysis

The average execution times differ by approximately 23 milliseconds. Disregarding the approximately 180 milliseconds of page fetch time, I think an overhead of 23 milliseconds for searching HTML nodes will suffice for now.

blog comments powered by Disqus