At Edit Huddle, we need a way to confirm that the users who attempt to register a blog are actually authorized to do so.
To this end, we issue users a unique blog key upon registration. This key is used to query our server for the Edit Huddle plugin whenever a reader visits a blog. It is this key that we scrape for to determine whether or not to activate the Edit Huddle plugin on a blog.
Necessarily, we want our scraper to be fast. While most of our site uses Django, I chose to write a one-off scraping utility in Haskell.
scrape-for-key
is written in Haskell using the Haskell XML Toolbox (HXT) and curl
.
The code is 106 lines (including getOpts
stuff) hacked together from some HXT examples. Primarily, we want to
src
attributes of any script
elements in document (see selectScriptTagSrc
)containsKey
)atTagCase tag = deep (isElem >>> hasNameWith ((== tag') . upper . localPart))
where tag' = upper tag
upper = map toUpper
selectScriptTagSrc = atTagCase "script" >>> getAttrValue "src" -- Step 2
containsKey key src = case parseURI src of -- Step 3
Nothing -> False
Just uri -> case uriAuthority uri of
Nothing -> False
Just uriAuth -> "domain.com" `isSuffixOf` uriRegName uriAuth
&& "script.js" `isSuffixOf` uriPath uri
&& ("key="++key) `isInfixOf` tail (uriQuery uri) -- `tail` trims the leading '?' from the uriQuery
Our main
handles step 1 above, where we fetch the page before threading it through steps 2 and 3:
(_, body) <- curlGetString blogURL [CurlFollowLocation True, CurlMaxRedirs 5]
tags <- runX (parseHTML body >>> selectScriptTagSrc)
forM_ tags (\src -> if containsKey blogKey src then exitWith ExitSuccess else return "")
exitFailure
I used the Perl utility dumbbench
to test both scrape-for-key
and curl
. In this way I can estimate the time that HTML parsing and searching takes compared to fetching the pages.
First scrape-for-key
:
$ dumbbench -v -p 0.001 -- bin/scrape-for-key -b http://edithuddle.com/blog/?p=27 -k tlwnsjgrv9cytnz1
cmd: Running initial timing for warming up the cache...
cmd: Running 20 initial timings...
cmd: Iterating until target precision reached...
cmd: Ran 542 iterations (52 outliers).
cmd: Rounded run time per iteration: 2.0316e-01 +/- 2.1e-04 (0.1%)
cmd: Raw: 0.203163616832649 +/- 0.00020541722587663
Then curl
:
$ dumbbench -v -p 0.001 -- curl http://edithuddle.com/blog/?p=27 -s -o /dev/null
cmd: Running initial timing for warming up the cache...
cmd: Running 20 initial timings...
cmd: Iterating until target precision reached...
cmd: Ran 759 iterations (8 outliers).
cmd: Rounded run time per iteration: 1.8028e-01 +/- 1.8e-04 (0.1%)
cmd: Raw: 0.180279872073403 +/- 0.000181318786684558
The average execution times differ by approximately 23 milliseconds. Disregarding the approximately 180 milliseconds of page fetch time, I think an overhead of 23 milliseconds for searching HTML nodes will suffice for now.