Results 1 to 14 of 14

Thread: Running a search engine: What's important?

  1. #1
    Registered
    Join Date
    Mar 2004
    Location
    Philadelphia, PA
    Posts
    106

    Running a search engine: What's important?

    I've always wanted to set up my own search engine. I don't mean a site search, I mean a full-blown search engine with its own index of the web. With Nutch out there, it shouldn't be too hard to get started.

    I have a bunch of ideas for interesting ways of ranking websites and displaying the results that I'd love to play with. I love complex software algorithms -- there's a patent pending for one I developed for a past employer.

    I have access to Drexel U's network, which is largely unpoliced and unrestricted, and I've got about 25mbps of bandwidth with no limits.

    I can't spare the hard drive space or CPU time to do this on the three servers I have now, but I have some spare parts in my big box o' junk:
    - Athlon XP 2500+ processor and motherboard
    - 80GB hard drive
    - SB Live! sound card
    - GeForce FX5200 graphics card
    - 10/100 network adapter
    - 500W power supply (picked it up free-after-rebate from radioshack)

    If I order a cheap case and some memory, that'll be a complete system.

    So here's a question... how much can I realistically accomplish with this? Will 512MB of memory be enough to handle a small search engine or will it die when a couple people are using it at once? How much of the internet can I expect to spider with only 80GB of storage available? Will this index even be large enough to see if any algorithms I can develop have interesting or useful results?

    What do you think?
    Last edited by Dan Grossman; 03-07-2006 at 10:12 PM.
    I'm Dan. This is my blog. I give you... free web stats.

  2. #2
    Registered Bleys's Avatar
    Join Date
    Feb 2006
    Location
    RI-USA
    Posts
    170
    I really can't add anything of value to this, but from Nutch's site:

    "Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines."

    (this compares to 'intranet crawling' which are crawls of about one million pages or several servers)

  3. #3
    Registered
    Join Date
    Mar 2004
    Location
    Philadelphia, PA
    Posts
    106
    If crawling doesn't take much in terms of cpu usage, and can be easily distributed over machines somehow, maybe I can look into it. I've got 5 or 6 machines less than two years old each that could be put to use.

    I don't really want to make anything commercial out of this, but it'd be nice to have a big enough index to play with.
    I'm Dan. This is my blog. I give you... free web stats.

  4. #4
    Registered
    Join Date
    Mar 2006
    Posts
    330
    Dan it never hurts to try, I have seen a guy start a spidering search engine out of his home!

  5. #5
    Senior Member AndyH's Avatar
    Join Date
    May 2004
    Location
    Australia
    Posts
    553
    I think it is something that trial and error is needed with.

    80GB? HDs are cheap, why not get a few more?
    New website released. ya rly!

  6. #6
    Registered
    Join Date
    Mar 2004
    Location
    Philadelphia, PA
    Posts
    106
    Quote Originally Posted by AndyH
    80GB? HDs are cheap, why not get a few more?
    Cheap is a relative word. When we're talking about a toy, spending any money is hard to justify. That's something that's easy to expand in the future at least.
    I'm Dan. This is my blog. I give you... free web stats.

  7. #7
    Registered
    Join Date
    Mar 2006
    Posts
    330
    Here is the story of UKWIZZ, this guy started this out of his home!

    http://www.cre8asiteforums.com/forum...hl=ukwizz&st=0

    My user name in this thread is "AC"

    We had a nice discussion on this start up spidering search engine, he had it going, I just looked at the site last week or so, but the site seems to be down now, maybe he is having server problems or just shut it down!

  8. #8
    Website Developer
    Join Date
    Oct 2004
    Posts
    1,607
    This is what you need:
    http://websearch.alexa.com/

    I'm not much of a technical guy, but this would blow away anything you could do on a couple of spare machines.
    Make more money - Read my Web Publishing Blog

  9. #9

  10. #10
    Registered
    Join Date
    Mar 2006
    Posts
    75
    I've always wanted to do the same thing Dan and I did notice Nutch a little while back.

    I ended up asking myself several questions that always kept me from going forward.
    without asking them all what stopped me was the resource and time required. Additionally, I wasn't sure how I was going to offer something with a different twist that wasn't already out there. That would really be the first thing to figure out before jumping in.

    Do you plan to have a niche search engine like say, boardtracker.com, or do you want to go with a full on internet search index?

  11. #11
    Registered
    Join Date
    Mar 2006
    Posts
    330

    Red face

    The bottom line if you are going to use your own servers to spider sites and maintain an ever growing index is the fact that you need a great data center with redundant servers in case one goes down and great programmers to keep it all going!

    If you don't have the money to do the above there is no way to grow it!
    Last edited by FPU; 03-08-2006 at 07:45 PM.

  12. #12
    Registered
    Join Date
    Mar 2006
    Posts
    17
    Dan,

    I would image you would need a lot more hard drive space. I've got some sites that have databases with over 3 gig of data. This is raw data, no HTML. If you crawled some of my sites you could fill up the 80g hard drive. I'd bet you could try crawling a large site (about.com maybe) and fill up all your space.

    I don't know much about search engines, that's just my initial thoughts.

    Grant

  13. #13
    Registered Bleys's Avatar
    Join Date
    Feb 2006
    Location
    RI-USA
    Posts
    170
    There must be some sort of compression involved, right? I can't imagine that Google has local, uncompressed copies of however many billions of pages it indexes... that would take up... an insane amount of space.

  14. #14
    Registered Bleys's Avatar
    Join Date
    Feb 2006
    Location
    RI-USA
    Posts
    170
    Dan, you might be interested in this article: http://www.internetnews.com/xSP/article.php/3487041

    Excerpt:

    The key to the speed and reliability of Google search is cutting up data into chunks, its top engineer said.

    Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture on Wednesday. Hoelzle spoke here at EclipseCon 2005, a conference on the open source, extensible platform for software tools.

    To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.

Similar Threads

  1. SEO paying someone to do it.
    By jr1966 in forum Search Engine Optimization
    Replies: 21
    Last Post: 09-08-2004, 06:21 AM
  2. search engine links
    By martinkuria in forum Search Engine Optimization
    Replies: 0
    Last Post: 03-29-2004, 11:53 PM
  3. Replies: 5
    Last Post: 01-09-2004, 12:44 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •