PDA

View Full Version : Running a search engine: What's important?



Dan Grossman
03-07-2006, 10:09 PM
I've always wanted to set up my own search engine. I don't mean a site search, I mean a full-blown search engine with its own index of the web. With Nutch (http://lucene.apache.org/nutch/) out there, it shouldn't be too hard to get started.

I have a bunch of ideas for interesting ways of ranking websites and displaying the results that I'd love to play with. I love complex software algorithms -- there's a patent pending for one I developed for a past employer.

I have access to Drexel U's network, which is largely unpoliced and unrestricted, and I've got about 25mbps of bandwidth with no limits.

I can't spare the hard drive space or CPU time to do this on the three servers I have now, but I have some spare parts in my big box o' junk:
- Athlon XP 2500+ processor and motherboard
- 80GB hard drive
- SB Live! sound card
- GeForce FX5200 graphics card
- 10/100 network adapter
- 500W power supply (picked it up free-after-rebate from radioshack)

If I order a cheap case and some memory, that'll be a complete system.

So here's a question... how much can I realistically accomplish with this? Will 512MB of memory be enough to handle a small search engine or will it die when a couple people are using it at once? How much of the internet can I expect to spider with only 80GB of storage available? Will this index even be large enough to see if any algorithms I can develop have interesting or useful results?

What do you think?

Bleys
03-07-2006, 10:51 PM
I really can't add anything of value to this, but from Nutch's site:

"Whole-web crawling is designed to handle very large crawls which may take weeks to complete, running on multiple machines."

(this compares to 'intranet crawling' which are crawls of about one million pages or several servers)

Dan Grossman
03-07-2006, 11:15 PM
If crawling doesn't take much in terms of cpu usage, and can be easily distributed over machines somehow, maybe I can look into it. I've got 5 or 6 machines less than two years old each that could be put to use.

I don't really want to make anything commercial out of this, but it'd be nice to have a big enough index to play with.

FPU
03-07-2006, 11:15 PM
Dan it never hurts to try, I have seen a guy start a spidering search engine out of his home!

AndyH
03-07-2006, 11:20 PM
I think it is something that trial and error is needed with.

80GB? HDs are cheap, why not get a few more?

Dan Grossman
03-07-2006, 11:23 PM
80GB? HDs are cheap, why not get a few more?

Cheap is a relative word. When we're talking about a toy, spending any money is hard to justify. That's something that's easy to expand in the future at least.

FPU
03-07-2006, 11:27 PM
Here is the story of UKWIZZ, this guy started this out of his home!

http://www.cre8asiteforums.com/forums/index.php?showtopic=14151&hl=ukwizz&st=0

My user name in this thread is "AC" :(

We had a nice discussion on this start up spidering search engine, he had it going, I just looked at the site last week or so, but the site seems to be down now, maybe he is having server problems or just shut it down!

Cutter
03-08-2006, 12:39 AM
This is what you need:
http://websearch.alexa.com/

I'm not much of a technical guy, but this would blow away anything you could do on a couple of spare machines.

Blue Cat Buxton
03-08-2006, 03:31 AM
Yeah, but that sorta takes the fun out of it, doesn't it?

BGray
03-08-2006, 04:10 AM
I've always wanted to do the same thing Dan and I did notice Nutch a little while back.

I ended up asking myself several questions that always kept me from going forward.
without asking them all what stopped me was the resource and time required. Additionally, I wasn't sure how I was going to offer something with a different twist that wasn't already out there. That would really be the first thing to figure out before jumping in.

Do you plan to have a niche search engine like say, boardtracker.com, or do you want to go with a full on internet search index?

FPU
03-08-2006, 08:56 AM
The bottom line if you are going to use your own servers to spider sites and maintain an ever growing index is the fact that you need a great data center with redundant servers in case one goes down and great programmers to keep it all going!

If you don't have the money to do the above there is no way to grow it!

Grant29
03-10-2006, 12:01 PM
Dan,

I would image you would need a lot more hard drive space. I've got some sites that have databases with over 3 gig of data. This is raw data, no HTML. If you crawled some of my sites you could fill up the 80g hard drive. I'd bet you could try crawling a large site (about.com maybe) and fill up all your space.

I don't know much about search engines, that's just my initial thoughts.

Grant

Bleys
03-10-2006, 01:01 PM
There must be some sort of compression involved, right? I can't imagine that Google has local, uncompressed copies of however many billions of pages it indexes... that would take up... an insane amount of space.

Bleys
03-12-2006, 05:20 PM
Dan, you might be interested in this article: http://www.internetnews.com/xSP/article.php/3487041

Excerpt:


The key to the speed and reliability of Google search is cutting up data into chunks, its top engineer said.

Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture on Wednesday. Hoelzle spoke here at EclipseCon 2005, a conference on the open source, extensible platform for software tools.

To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.