Page 1 of 2 12 LastLast
Results 1 to 15 of 26

Thread: Crawling Other Sites

  1. #1
    Registered Mike's Avatar
    Join Date
    May 2003
    Location
    UK
    Posts
    2,755

    Crawling Other Sites

    Hi all,

    Would anyone be able to tell me the function to "crawl" another website? I know there isn't one specific function, but could you like give me a basic idea of what you need to know.

    Thanks,
    Mike
    Don't you just love free internet games ?

  2. #2
    Senior Member chromate's Avatar
    Join Date
    Aug 2003
    Location
    UK
    Posts
    2,348
    I've never done it, but I guess you would have to make an HTTP request and then read the results into a variable. Then use some string functions to find what you're looking for.

    Pas mentioned a good book that discusses this in the 4fineart thread I think.

  3. #3
    Administrator Chris's Avatar
    Join Date
    Feb 2003
    Location
    East Lansing, MI USA
    Posts
    7,055
    Basically load the content of the page.

    Parse out all html links.

    Enter the URLs into an array (or db).

    Cycle through the array (DB) pulling each page.

    Parse out all html links....

    So on and so forth.
    Chris Beasley - My Guide to Building a Successful Website[size=1]
    Content Sites: ABCDFGHIJKLMNOP|Forums: ABCD EF|Ecommerce: Swords Knives

  4. #4
    Registered Mike's Avatar
    Join Date
    May 2003
    Location
    UK
    Posts
    2,755
    Would you make a HTTP request, as chromate said, to load the content of the page Chris?

    Thanks a lot,
    Mike
    Don't you just love free internet games ?

  5. #5
    Administrator Chris's Avatar
    Join Date
    Feb 2003
    Location
    East Lansing, MI USA
    Posts
    7,055
    Yes.

    PHP has a file() function that can fetch the contents of a remote file.
    Chris Beasley - My Guide to Building a Successful Website[size=1]
    Content Sites: ABCDFGHIJKLMNOP|Forums: ABCD EF|Ecommerce: Swords Knives

  6. #6
    Registered Member incka's Avatar
    Join Date
    Aug 2003
    Location
    Wakefield, UK, EU
    Posts
    3,801
    If you want to fill up your server do a complete data crawl of wikipedia.org

  7. #7
    Registered Mike's Avatar
    Join Date
    May 2003
    Location
    UK
    Posts
    2,755
    Originally posted by Chris
    Yes.

    PHP has a file() function that can fetch the contents of a remote file.
    I may give it ago then

    Is it alright to do it? Or could the site owners not like it because it's eating up their bandwidth?

    Thanks very much,
    Mike
    Last edited by Mike; 03-10-2004 at 12:22 PM.
    Don't you just love free internet games ?

  8. #8
    Future AstonMartin driver r2d2's Avatar
    Join Date
    Dec 2003
    Location
    UK
    Posts
    1,608
    They probably wouldn't like it, but what can they do?

  9. #9
    Registered flyingpylon's Avatar
    Join Date
    Sep 2003
    Location
    Fishers, IN USA
    Posts
    144
    Why do you need to crawl another site?

    If it's just to grab everything on the site and download it, there are products that do that already. Why reinvent the wheel? If you need to be able to search that other site, there are products that do that too.

    However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.

  10. #10
    Registered Mike's Avatar
    Join Date
    May 2003
    Location
    UK
    Posts
    2,755
    Originally posted by flyingpylon

    However, if you really need to parse out specific data and put it in a database, then I suppose you'd need to write your own script.
    Exactly

    Will it be legal to crawl another site then? Or could some consider it as an attack?

    Thanks,
    Mike
    Don't you just love free internet games ?

  11. #11
    Registered Member incka's Avatar
    Join Date
    Aug 2003
    Location
    Wakefield, UK, EU
    Posts
    3,801
    Google do it, and I don't see them being sued for it...

  12. #12
    Future AstonMartin driver r2d2's Avatar
    Join Date
    Dec 2003
    Location
    UK
    Posts
    1,608
    You are just reading their website - they have put it there for all to read.

    What exactly you are doing with it might be a problem - i.e. they cant stop you just fetching their website, but they would have a problem with you just republishing it! Im sure thats not what you were going to do, its just an example...

  13. #13
    Registered Member incka's Avatar
    Join Date
    Aug 2003
    Location
    Wakefield, UK, EU
    Posts
    3,801
    Yeah, only do it one site that either allow it, like lets say an affiliate program, or a site that is open source, like wikipedia.

  14. #14

  15. #15
    Registered Mike's Avatar
    Join Date
    May 2003
    Location
    UK
    Posts
    2,755
    Thanks for all the replies...

    Going off topic a little here, but isn't what google are doing against copyright laws? Like they are displaying part of someone's website content aren't they?

    Mike
    Don't you just love free internet games ?

Similar Threads

  1. Content Sites vs. ECommerce sites
    By ASP-Hosting.ca in forum Advertising & Affiliate Programs
    Replies: 45
    Last Post: 05-10-2004, 08:07 PM
  2. datafeed driven sites
    By Nick in forum Advertising & Affiliate Programs
    Replies: 2
    Last Post: 04-04-2004, 12:36 PM
  3. UK Sites Wanted! Earn high CPM!
    By incka in forum The Marketplace
    Replies: 2
    Last Post: 01-11-2004, 01:52 PM
  4. affiliate help sites
    By s2kinteg916 in forum Advertising & Affiliate Programs
    Replies: 5
    Last Post: 12-11-2003, 01:42 PM
  5. Affiliate Sites
    By Matt in forum Advertising & Affiliate Programs
    Replies: 5
    Last Post: 11-23-2003, 02:39 AM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •