[racket] data mining business information on web sites w/Racket
Oh yeah, I forgot to mention backing off and such.
When I first did Web crawling of particular sites around 10 years ago
with PLT Scheme, I had to schedule my visits with delays so as not to
abuse the site. There was a random component to the scheduling, too.
Note that an off-the-shelf tool might not necessarily work
satisfactorily for scraping. In one case, I also had to emulate the 2-
or 3-click path that a human user would take through the site to get to
the information, because their URLs (and the info behind them!) would
change potentially a few times a minute. (Any anti-crawler mechanisms
on this particular site were intended to thwart content-stealing
competitors, not me.) So, if you find you need natural sequencing for
time-sensitive HTTP requests within your scheduling, like I did, and you
can think of a really easy way to do that, you might find it more
expedient to hack up exactly what you need, rather than evaluate a bunch
of off-the-shelf frameworks to see whether any of them will do what you
need.
Noel Welsh wrote at 03/18/2011 05:06 PM:
> It should be fine. Hundreds of sites is not really that many. You just
> need to have backoffs etc. to avoid getting blacklisted. Using sync
> and friends would make implementing this easy.
>
--
http://www.neilvandyke.org/