PDA

View Full Version : Does Googlebot hit your AWS Sites Hard?



pas
08-26-2005, 02:40 PM
So far today, Googlebot has hit my AWS sites about 45,000 times. It just extensively crawled some of these sites a few days ago. I've checked the logs, and Googlebot is often requesting pages one or more times per second, which is out of line. When Googlebot is crawling (which always seems to be during peak hours), it definitely loads the CPU. I'd contact Google, but am somewhat afraid I might incur the "thin affiliate" penalty. I'm looking into using mod_bwshare or bw_mod. Anybody else getting hit hard by Googlebot with their AWS sites?

sandman
08-26-2005, 02:51 PM
I've heard that this is called "Google Bombing". I've had the "Become.com" robot do the same thing to me. I put a snippit in my robots.txt file that slows it down to a request every 10 seconds to ease the load. I bet you could do the same to the google bot.

This is what I used:

User-agent: BecomeBot
Crawl-Delay: 10

This is strait off of Become.com's website about thier bot.

pas
08-26-2005, 03:09 PM
Hmm, I searched a bit and it doesn't appear that Googlebot recognizes "crawl-delay". Yahoo does (http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html), but it appears pretty well-behaved.

pas
08-26-2005, 03:14 PM
msnbot does too (http://search.msn.com/docs/siteowner.aspx?t=SEARCH_WEBMASTER_REF_RestrictAcce ssToSite.htm), but that's well-behaved too (aside from an isolated bombing of a rss feed).

pas
08-26-2005, 03:27 PM
This might be something to try:

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell Google whether your content has changed since we last crawled your site. Supporting this feature saves you bandwidth and overhead.
http://www.google.com/webmasters/guidelines.html

Would avoid calling AWS for whatever time period.

Chris
08-26-2005, 03:46 PM
Google Bombing is something else entirely. It refers to a massive number of indentical anchor text links (usually from blogs) pushing an unrelated page to the top of the search results.

James
08-26-2005, 11:06 PM
Chris is right. We all remember 'miserable failure' and that had nothing to do with Google crashing the President's biography page.

Currently, my 2 sites of which I have access to the stats of which have lots of pages are
http://buy-video-games.net
http://cheatfire.com

They get quite a bit of search engine crawls (or at least, what I consider quite a bit)
http://toolazytoblog.com/temp/buyvideogamesspiders.gif
http://toolazytoblog.com/temp/cheatfirespiders.gif

But I've never had it even slow down the server that I've ever noticed. Everything runs fine, and I've never seen it crash.

vatsia
08-27-2005, 12:39 AM
Have you also noticed a change of the number of pages that are indexed in Google? It visits every page, but at the end google reduces the number of indexed pages of AWS sites :mad:

pas
08-27-2005, 10:46 AM
I switched from SOAP to REST calls - seems a bit quicker. I considered caching of various sorts, but just implemented checking for the If-Modified-Since header and sending "Not Modified" if it's within a certain time period. Should hopefully minimize the impact a great deal.

chrispian
08-27-2005, 03:04 PM
Does Googlebot hit your AWS Sites Hard?

Like a pimp smacking his ho's.

James
08-27-2005, 11:42 PM
Like a pimp smacking his ho's.OH SNAP! :eek: DAS OFF DA HIZZEH!