Sunday, November 28, 2010

Extracting content from web pages for text mining

For the last three days I was looking for ways and tools to extract content (text) and data (for example emails, but that's just a beginning) from web pages.

The best tool I've found so far is Web Harvest:

Definitely it's not easy too use (it took me nearly a day to understand how exactly did it work), but looks the most powerful among the tools I've evaluated.

I've started with modifying the example of the simple site crawler, to make it more flexible and make it collect all emails found on a specified web site.

Source of my version of site crawler/email collector is available here:

Then, I've written a simple configuration file to collect email addresses from first 100 links returned by Google search for some specified key phrase.

Source code is available here:

Now, I'm moving to the topic I've had in my mind, when I started working on this project - i.e. extracting text content from web pages and analyzing it.

First step: researching topic: Michael J. Giarlo "A Comparative Analysis of Keyword Extraction Techniques"

