[racket] data mining business information on web sites w/Racket

From: Neil Van Dyke (neil at neilvandyke.org)
Date: Fri Mar 18 17:47:34 EDT 2011

Oh yeah, I forgot to mention backing off and such.

When I first did Web crawling of particular sites around 10 years ago 
with PLT Scheme, I had to schedule my visits with delays so as not to 
abuse the site.  There was a random component to the scheduling, too.

Note that an off-the-shelf tool might not necessarily work 
satisfactorily for scraping.  In one case, I also had to emulate the 2- 
or 3-click path that a human user would take through the site to get to 
the information, because their URLs (and the info behind them!) would 
change potentially a few times a minute.  (Any anti-crawler mechanisms 
on this particular site were intended to thwart content-stealing 
competitors, not me.)  So, if you find you need natural sequencing for 
time-sensitive HTTP requests within your scheduling, like I did, and you 
can think of a really easy way to do that, you might find it more 
expedient to hack up exactly what you need, rather than evaluate a bunch 
of off-the-shelf frameworks to see whether any of them will do what you 
need.

Noel Welsh wrote at 03/18/2011 05:06 PM:
> It should be fine. Hundreds of sites is not really that many. You just
> need to have backoffs etc. to avoid getting blacklisted. Using sync
> and friends would make implementing this easy.
>   

-- 
http://www.neilvandyke.org/


Posted on the users mailing list.