PDA

View Full Version : Stop the google madness!



Todd W
02-06-2007, 12:39 AM
Ok.. other than disallow in robots.txt what else can be done to BLOCK google from spydering my site?

I want/need the adsense bot on index.php but not index.php?ANYTHING is this possible?


This is not a joke.

kdb003
02-06-2007, 12:51 AM
i think this is what you are looking for:

User-agent: *
Disallow: /index.php?

Todd W
02-06-2007, 01:00 AM
i think this is what you are looking for:

User-agent: *
Disallow: /index.php?

Before I do that I really want to make sure it willonly block ?* and not index.php too.

Also, if google comes from say www.yoursite.com to www.mysite.com/index.php does it STILL read robots.txt or does it ony read robots.txt if it's starting on my site and spydering my pages?

Todd W
02-06-2007, 01:05 AM
To remove dynamically generated pages, you'd use this robots.txt entry:

User-agent: Googlebot
Disallow: /*?


Looks like it will work.. hmm I got the above from: http://www.google.com/support/webmasters/bin/answer.py?answer=35303

Yet when I goto

http://services.google.com:8882/urlconsole/controller (Google URL remover)

They then say:
URLs cannot have wild cards in them (e.g. "*"). The following line contains a wild card:
DISALLOW /*?


So one place google says "do this" and in another they say "it wont work".

:( Hmmm

kdb003
02-06-2007, 01:07 AM
I am very confident that this is the correct robots.txt. I suggest you use google's robots.txt checker in the webmaster tools to be absolutely sure.

http://www.google.com/webmasters/sitemaps/

and yes google will check robots.txt when it spiders your pages no matter where it comes from

Todd W
02-06-2007, 01:16 AM
I am very confident that this is the correct robots.txt. I suggest you use google's robots.txt checker in the webmaster tools to be absolutely sure.

http://www.google.com/webmasters/sitemaps/

and yes google will check robots.txt when it spiders your pages no matter where it comes from

Well I used waht google suggested first for blocking dynamic content from being spydered and then use the robots.txt analysis tool and I get.
URL Googlebot
http://www.mysite.com/ Allowed

:flare:

Todd W
02-06-2007, 01:19 AM
I tried:
User-agent: *
Disallow: /index.php?

Like you said and it too said googlebot allowed :eek:

kdb003
02-06-2007, 01:19 AM
Well I used waht google suggested first for blocking dynamic content from being spydered and then use the robots.txt analysis tool and I get.
URL Googlebot
http://www.mysite.com/ Allowed

:flare:

I am confused by your smiley. Aren't you trying to allow mysite.com/ and trying to disallow mysite.com/index.php?...

Try adding some extra lines of urls you want to test.

either robots.txt should work.

Todd W
02-06-2007, 01:22 AM
I am confused by your smiley. Aren't you trying to allow mysite.com/ and trying to disallow mysite.com/index.php?...

Try adding some extra lines of urls you want to test.

either robots.txt should work.

Yes trying to block inde.php?*

It's not saying anything is blocked...

Todd W
02-06-2007, 01:26 AM
I adde:

User-agent: Googlebot-Image
Disallow: /

And now it says:

Googlebot-Image
Allowed Syntax not understood


Me thinks it's not working.

kdb003
02-06-2007, 01:27 AM
Yes trying to block inde.php?*

It's not saying anything is blocked...

ok in the robots.txt checker add a few extra lines under http://mysite.com

ie
http://mysite.com/index.php
http://mysite.com/index.php?q=3439345983457

the last one should be the only one that is blocked.

Todd W
02-06-2007, 01:29 AM
I got it to work now.

It's PICKY on the URLS you want to test :ladysman:

Todd W
02-06-2007, 01:30 AM
For the record this works perfectly:

User-agent: Googlebot
Disallow: /*?

User-agent: Googlebot-Image
Disallow: /

I was not typing a full valid URL to test.

Index.php works
idnex.php? anything is blocked.

Todd W
02-06-2007, 08:46 AM
For the record... the two sites I put this on had google-bot re-attempt to download robots.txt this morning one 30minuts ago and one 2 hours ago.

Status 404 (Not found)

Yet if I goto /robots.txt it's clearly there... wow google!! Stop ignoring it!

Chris
02-06-2007, 09:27 AM
Have you tried putting the meta robots tag directly on the pages you do not want indexed? It will not stop Google from viewing the pages, but it should stop them from indexing the pages.

Todd W
02-06-2007, 09:45 AM
Have you tried putting the meta robots tag directly on the pages you do not want indexed? It will not stop Google from viewing the pages, but it should stop them from indexing the pages.

They're dynamicaly created based on what the user wants, and I haven't figured out how to place stuff in the <head> tag of that page yet but that's the idea for the future too. (proxy site).

Chris
02-06-2007, 10:33 AM
What about running the proxy off a directory instead of a file then? I've found it easier to ban based on directory than file, and you might even be able to do some .htaccess modrewrite to do it to reinforce the robots.txt

Todd W
02-06-2007, 06:30 PM
Google updated robots.txt again for one of my sites today and is now blocking what I need... they must load a cache of it then re-verify later. Or it sure seems that's what they did :)

Now waiting for the others to catch-up :lol: