PDA

View Full Version : AskJeeves Raped my AWS Site



Peter T Davis
01-18-2005, 07:34 AM
AskJeeves raped one of my AWS sites to the tune of 2.8 Gigs so far this month, and according to my stats has delivered no clicks. Anyone else seen anything like this? What would be the best way to stop it?

Chris
01-18-2005, 07:35 AM
If you really want to forgo being listed there you can block it with robots.txt

Peter T Davis
01-18-2005, 07:38 AM
Do you think that would be a good idea to do that? I just put the site up a couple weeks ago, if it takes a Gig of transfer per week at this rate, and delivers no traffic, I probably should block it. Has Jeeves done anything like that to your sites? Does it eventually produce traffic, thus making it worthwhile?

MarkB
01-18-2005, 07:48 AM
Jeeves sends SOME traffic my way, but all in all it's a waste of a search engine.

(IMO:))

Chris
01-18-2005, 08:01 AM
I would wait and see if you eventually get traffic.

Personally I have 5 terabytes of bandwidth monthly I only use a fraction of that so this isn't something I really pay attention to.

Westech
01-18-2005, 08:36 AM
You could try putting this in your robots.txt file:

User-Agent: teoma
crawl-delay: 60

This makes Ask Jeeves wait 60 seconds between page requests. You can experiment with changing this number until you slow it down enough.

More details about controlling the teoma spider can be found here: http://sp.ask.com/docs/about/tech_teoma.html

GTech
01-18-2005, 11:04 AM
Jeeves made love to one of my sites last week. Spent two days on the site, shows almost 9k hits in my web log:

AskJeeves 8872+21 31.07 MB 14 Jan 2005 - 08:26

I have blocked out most all bots from my site, except google, yahoo, msn, and (for now) Jeeves. I've been checking their index every few days to see if any results show up. So far none. I'm guessing it may take a while for them to appear.

I don't want every bot in creation indexing my sites. I've had to block a few by IP because they fail to comply with the robots.txt standard and continue right on indexing even though they are in my robots.txt file.

I'm waiting to see if any results from Jeeves show up over the next month or two. If not, they will be added to the list.

Xander
01-18-2005, 11:50 AM
I've had a similar problem with AskJeeves on my forum. It had about 10 or 20 concurrent sessions running, one of the moderators decided to ban the IP and we've not had trouble from it again(but we've never had much traffic from them so it was no problem). But I have noticed recently as the forum has grown a lot, there are atleast two of the majors search engines bots living at the site(which is ok as I have the spare capacity) but is strange.

davesplace1
01-18-2005, 12:16 PM
Hey I got 3 visitors from AskJeeves yesterday, time to run out and buy a new SUV :). Not much traffic from these minor search engines, but it still is traffic. I have never had a bot eat up a lot of bandwith, but my sites are mosly text anyway.

moonshield
01-18-2005, 06:01 PM
hey jeeves come on over. :)

Todd W
01-19-2005, 01:19 AM
Why ban something that may eventually pay itself off in the future. You wont know today, and most likely not next week but a month or two down the road "what if" Jeeve's was giving you a sale a day and you were now missing it?

1 or 2gb per-week is NOTHING. If this cost a lot for you it's possibly time to look into a new webhosting provider, or if you have the funds and wish a dedicated server or vps.

Some things to think about.

moonshield
01-19-2005, 01:57 PM
yea, search engines should be allowed to play as much as they please for they are what help drive the sites.

GTech
01-19-2005, 06:33 PM
Here is why you want to control your site with robots.txt and control the ones you want there and the ones you don't, get rid of.

If you have a small seven page site with some text and graphics, it would not be a problem. But if you have a site that generates dynamic content and has thousands of pages of categories that lead to an incredible amount of individual pages, then letting bots have a free run is not a good idea.


This is a small capture of hundreds of pages it was running through every 9-10 seconds:



Host: 133.9.238.77

/category-140-2.html
Http Code: 403 Date: Jan 19 15:05:35 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)
|
|
|

/category-140-3.html
Http Code: 403 Date: Jan 19 15:05:45 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)
|
|
|

/category-140-4.html
Http Code: 403 Date: Jan 19 15:05:56 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)
|
|
|

/category-140-5.html
Http Code: 403 Date: Jan 19 15:06:06 Http Version: HTTP/1.1 Size in Bytes: -
Referer: -
Agent: e-SocietyRobot(http://www.yama.info.waseda.ac.jp/~yamana/es/)
|
|
|


I have no use for this bot. Never seen it, never heard of it. It came along while google and msn both were crawling my site early in the morning. This log is from this afternoon when it came back to keep on going.

moonshield
01-20-2005, 05:30 PM
yea, but sometimes the unknown bots dont follow the robots.txt.

Todd W
01-21-2005, 10:03 AM
yea, but sometimes the unknown bots dont follow the robots.txt.

And sometimes they are for new upcoming search engines. If you want to deny yourself a possible head start over others who are not being spydered yet, well, then you should try to block them. Myself, the bandwidth is not noticable and would be worth it in the long run even if I paid a few extra bucks a month.

Xander
01-21-2005, 10:49 AM
Its usually only unscrupulous robots/spiders that don't follow robots.txt.