PDA

View Full Version : Web Crawling?



NickC
06-05-2004, 09:07 AM
Hi,

I would like to have a program where I can search an entire given web site for a specific keyword and then have it report back to me the word following it. From each page found that has my initial keyword, I would like to search three new keywords on that page and have it report back to me the word following it as well. Then I would like to have all the results put into a speadsheet like format for me to examin on or offline for later use. Does anyone know how I would go about doing this? Thank you.

-Nick

Chris
06-05-2004, 10:55 AM
Thats quite a project. It seems like the type of thing where if you have to ask how, it is beyond your abilities.

Basically you'll need to write your own webcrawler. I've done this in PHP but it is very very complex, and doesn't even work that well. You'll need to know alot about regular expressions.

Basically here is how the logic of it would work.

You feed the script a URL, it fetches it using PHP's file function. You scan the content of the URL for anchor tags, you parse out all the link URLs. You then weed out all the URLs that aren't for the local site. Put this list somewhere (an array or a database). Then you scan the content for your word. If found you use string functions to get the words after it (basically I'd get 50 characters after the word, plus the word, explode the string on a space so that each word gets it's own spot in an array. Then your word should be array[0] and the next word array[1] etc). Then you could enter these words into your database or whatever.

Next you would open up your array or database where you stored the link URLs, and load the URL at the top of the list, starting all over.

chromate
06-05-2004, 12:03 PM
There is an open source spider written in PHP. I think it's called PHPSpider or something similar. It's available on sourceforge. Might be worth checking out.