Robots.txt Verification [Archive] - Website Publisher Forums

View Full Version : Robots.txt Verification

Todd W

06-07-2007, 11:21 AM

I want to prevent search engines from accessing mysite.com/rss/whatever
and
mysite.com/rss.php?blah=whatever

I believe this si the correct stuff for robots.txt but wanted to verify with everyone here to be 100% sure.

User-Agent: *
Disallow: /rss/
Disallow: /rss.php

KLB

06-07-2007, 11:34 AM

Looks good to me.

MaxS

06-07-2007, 12:28 PM

Google's Sitemaps has a tool in the control panel that tests your robots.txt file for you.

agua

06-07-2007, 04:15 PM

Do rss feeds count as duplicate content?

Todd W

06-07-2007, 06:11 PM

Do rss feeds count as duplicate content?

YES! Watch out.

KLB

06-07-2007, 06:41 PM

YES! Watch out.

Especially if others use your RSS feeds on their sites. My advice is to only provide short descriptions in RSS feeds, not entire articles.

agua

06-07-2007, 09:20 PM

Thanks - time to edit the robots.txt on a few sites :(

Xander

06-08-2007, 10:59 AM

YES! Watch out.

Do you have a link where Google confirm that? I'm surprised if their bots can't tell the difference.

Todd W

06-08-2007, 02:05 PM

Do you have a link where Google confirm that? I'm surprised if their bots can't tell the difference.

Content is content... google doesn't care if it's RSS, XML, CSV if it's the SAME content and is in duplicate places it hurts you.

More information about duplicate content and supplemental results can be found on google blog.

agua

06-09-2007, 05:53 PM

On this subject again, do you have any concrete proof that feeds are classed as duplicate content?

The reason I ask is because I checked the robots.txt on a few authority sites (seobook.com/robots.txt) and they didn't have them blocked. And also watched this wordpress SEO (http://wolf-howl.com/video/make-wordpress-search-engine-friendly/) video from Michael Gray and didn't catch a mention of blocking feeds

Also if you blocked them - how would that work with Google Blog (http://blogsearch.google.com/) Search?

Xander

06-10-2007, 02:22 PM

Content is content... google doesn't care if it's RSS, XML, CSV if it's the SAME content and is in duplicate places it hurts you.

More information about duplicate content and supplemental results can be found on google blog.

It depends on how you set it up, if done correctly it isn't counted, you just have to look at Google's blog itself.

ozgression

06-10-2007, 06:51 PM

On this subject again, do you have any concrete proof that feeds are classed as duplicate content?

I do. Google was indexing my rss feed and my page with the same content on it was in the supplement index. Once I stopped the search engines from being able to index the feed (wordpress allows search engines to spider the feeds easily btw), the rss feed dissapeared from the search engines and my page came out of the supplement index.

agua

06-10-2007, 09:01 PM

Thanks ozgression

Todd W

06-10-2007, 10:46 PM

I do. Google was indexing my rss feed and my page with the same content on it was in the supplement index. Once I stopped the search engines from being able to index the feed (wordpress allows search engines to spider the feeds easily btw), the rss feed dissapeared from the search engines and my page came out of the supplement index.

Exactly.

Xander

06-11-2007, 12:14 AM

Todd W

06-11-2007, 08:13 AM

Thanks for the info, I'm just surprised Googlebot is not smart enough to tell the difference when there is a clear enough difference.

No, there is no difference in content.

If google didn't count XML/RSS feed stuff as the same as normal HTML content then there would be even MORE RSS scraper sites. ;)

agua

06-11-2007, 05:22 PM

This is a really interesting topic.

I've just been shown this interview with Adam Lasnik on duplicate content (http://www.stonetemple.com/articles/interview-adam-lasnik.shtml)
Eric Enge: Another issue relating to duplicate content is that RSS feeds get indexed by default. I understand that there is a mechanism now in RSS feeds to Noindex your feed.

Adam Lasnik: That's interesting. I don't think that I am familiar with that; that's not to say that it's not there or it's not a standard, but it's not something that I have explored.

Eric Enge: I have heard actually that Google supports it, but perhaps that is not true. Regardless, the goal is to reduce unintentional duplicate content much like with the printer page example you gave before.

Adam Lasnik: I will be happy to check offline with some of my colleagues to see if they are familiar with the Noindex directive within RSS. But an equally efficient way of going about that would be to put your RSS feed within a directory that is itself not crawled because of robots.txt.

Eric Enge: So, if you make the access to the feed through a directory which robots.txt indicated should not be crawled, as you suggested, you might be set.

Adam Lasnik: Yes. It's my understanding that this would also work for Google blog search as well. But, let me do some checking on that, and also I know that in my time at Google, I have not seen this to be an issue, where RSS content has in a negative way affected sites in the area of duplication, largely because it is rare that we actually list RSS feeds within our core index. I am fairly sure this wouldn't really rank as a significant concern with regards to duplicate content.

Todd W

06-11-2007, 08:21 PM

His last statement made LOL.

I've seen plenty RSS feeds indexed including my own... why risk it when you can put the simple line in your robots.txt file?

agua

06-11-2007, 11:01 PM

Yeah - the overall feel of the article doesn't instill confidence :)

Todd W

07-03-2007, 11:28 AM

Wanted to bring this up to confirm that.

Disallow: /pop.php

Will in fact disallow site.com/pop.php?whatever-is=here ?

-Todd

Webnauts

08-01-2007, 02:46 PM

Do this:

User-Agent: *
Disallow: /rss.php

You can validate your robots.txt in your Google Webmaster Tools if you have an account already, or here: http://tool.motoricerca.info/robots-checker.phtml

Some advanced tips, but only for Google may be found here:
http://www.google.com/support/webmasters/bin/answer.py?answer=40367