[racket] data mining business information on web sites w/Racket

From: Noel Welsh (noelwelsh at gmail.com)
Date: Fri Mar 18 17:06:28 EDT 2011

It should be fine. Hundreds of sites is not really that many. You just
need to have backoffs etc. to avoid getting blacklisted. Using sync
and friends would make implementing this easy.

If you want to extract unstructured data, there is some good reading here:

  http://metaoptimize.com/qa/questions/3440/text-extraction-from-html-pages

Probably roping together various existing systems would be the
efficient way to get a scalable solution working. See Apache Tika and
projects referenced above. There is also a surprising amount of work
on "scalable web spider"s. (Google that phrase if you're interested.)

HTH,
N.

On Fri, Mar 18, 2011 at 7:29 PM, Geoffrey S. Knauth <geoff at knauth.org> wrote:
> I'm evaluating whether to use Racket to data mine hundreds of websites pulling out business information within an industry.  I think Racket is up to it, but I'm wondering if anyone else has had experiences positive or negative.  I've used other tools to do rudimentary digging, but this project is likely to touch AI, which brings me back to the Lisp family.
>
> Geoff



Posted on the users mailing list.