PDA

View Full Version : Crawling Other Sites



Mike
03-10-2004, 09:13 AM
Hi all,

Would anyone be able to tell me the function to "crawl" another website? I know there isn't one specific function, but could you like give me a basic idea of what you need to know.

Thanks,
Mike

chromate
03-10-2004, 09:58 AM
I've never done it, but I guess you would have to make an HTTP request and then read the results into a variable. Then use some string functions to find what you're looking for.

Pas mentioned a good book that discusses this in the 4fineart thread I think.

Chris
03-10-2004, 11:00 AM
Basically load the content of the page.

Parse out all html links.

Enter the URLs into an array (or db).

Cycle through the array (DB) pulling each page.

Parse out all html links....

So on and so forth.

Mike
03-10-2004, 11:12 AM
Would you make a HTTP request, as chromate said, to load the content of the page Chris?

Thanks a lot,
Mike

Chris
03-10-2004, 11:57 AM
Yes.

PHP has a file() function that can fetch the contents of a remote file.

incka
03-10-2004, 12:09 PM
If you want to fill up your server do a complete data crawl of wikipedia.org

Mike
03-10-2004, 12:17 PM
Originally posted by Chris
Yes.

PHP has a file() function that can fetch the contents of a remote file.

I may give it ago then:)

Is it alright to do it? Or could the site owners not like it because it's eating up their bandwidth?

Thanks very much,
Mike

r2d2
03-10-2004, 12:36 PM
They probably wouldn't like it, but what can they do?

flyingpylon
03-10-2004, 02:43 PM
Why do you need to crawl another site?

If it's just to grab everything on the site and download it, there are products that do that already. Why reinvent the wheel? If you need to be able to search that other site, there are products that do that too.

However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.

Mike
03-10-2004, 03:08 PM
Originally posted by flyingpylon

However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.

Exactly :)

Will it be legal to crawl another site then? Or could some consider it as an attack?

Thanks,
Mike

incka
03-10-2004, 03:19 PM
Google do it, and I don't see them being sued for it...

r2d2
03-10-2004, 03:54 PM
You are just reading their website - they have put it there for all to read.

What exactly you are doing with it might be a problem - i.e. they cant stop you just fetching their website, but they would have a problem with you just republishing it! Im sure thats not what you were going to do, its just an example...

incka
03-10-2004, 03:59 PM
Yeah, only do it one site that either allow it, like lets say an affiliate program, or a site that is open source, like wikipedia.

Dan Morgan
03-10-2004, 04:18 PM
There are a couple of Spiders written available on Sourceforge.net...

Mike
03-11-2004, 12:43 AM
Thanks for all the replies...

Going off topic a little here, but isn't what google are doing against copyright laws? Like they are displaying part of someone's website content aren't they?

Mike

r2d2
03-11-2004, 02:04 AM
Unlikely to be sued though, cos this is benefit to the copyright holder...

Its a difficult one though... I guess it is a very small amount of any page, and they always show where it has actually come from.

Chris
03-11-2004, 07:44 AM
Not really. I believe someone tried suing them but failed. Basically because there are tools for someone to remove their site.

incka
03-11-2004, 09:18 AM
I bet the suer was google-watch.org

Mike
03-11-2004, 09:35 AM
Ok, thanks for all the responses guys :)

Mike
03-12-2004, 10:54 AM
I had a go at this last night, testing it with my site. I came to the parsing links bit though, and really didn't know where to go. So for two hours today I've been searching around php.net, but not found anything that works for me. I think its something to do with preg_grep, but I can't think how it will work.

Could anyone help?

Thanks very much,
Mike

Chris
03-12-2004, 11:41 AM
you definitely need to use regular expressions.

r2d2
03-12-2004, 12:11 PM
Finding links using Regular Expression Syntax (http://www.dotnetindex.com/read.asp?articleID=60)

Hope that helps.

This too: Regular Expressions in PHP (http://www.zend.com/zend/spotlight/code-gallery-wade5.php)

GCT13
03-12-2004, 12:36 PM
Thanks for those links.

Mike
03-12-2004, 12:57 PM
Yeh, thanks r2d2. I've just had a quick read, and the php one seems very useful:)

I made the following after reading it, but it's not working. Does anyone know what's wrong?



<?php
$file_lines = file("http://www.sitestem.com");
$page = htmlspecialchars($line);
foreach ($file_lines as $line) echo htmlspecialchars($line);

$match = ereg("href", $page);

if($match) {
echo "yes!";
}
?>


Thanks a lot,
Mike

r2d2
04-15-2004, 03:00 PM
Im just starting to use snoopy which I think came from PEAR (PHP function library type thing). I'm using it to make a POST request to a search page. Its pretty cool stuff. You should check it out if you were still doing this stuff.

mobilebadboy
04-15-2004, 03:28 PM
http://phpdig.net, unless you're just absolutely wanting to build something yourself (which I can assume you are). If it already exists, I'd rather utilize my time elsewhere. ;)