PDA

View Full Version : Insecurity of robots.txt ?



Nico
04-16-2007, 09:25 AM
I just realized that anybody can access your robots.txt file and see all the pages/directories that you are listing there.
Is not that robots.txt is really insecure, but from a Web Security perspective it's giving too much info to a potential attacker. The only thing i have to do to find, let's say the "Admin Panel" of a site, is to check their robots.txt file and see if it's there.

It's not that someone will hack your site by using robots.txt (of course not!), but it's a quick and easy way for an outsider to start gathering info for their attack.
I like the concept of "Security through obscurity (http://en.wikipedia.org/wiki/Security_through_obscurity)". It's not perfect, but i think it's better if outsiders don't even know what my admin panel is, right?

The question is...is it worth putting our "Admin Panels" url (or other important directories) in robots.txt?

polspoel
04-16-2007, 09:34 AM
If you like security through obscurity, you would never even link to your admin URL anywhere on your site, giving you no reason to even list it in robots.txt (since google wouldn't know about it)

You could alternatively block if via <meta headers> if you think they somehow still get access to it (google)

Anyway, just make sure all your files are protected and what not and you'll be fine.

Selkirk
04-16-2007, 09:43 AM
Yeah, the admin panel should have authentication which would keep the robots out anyway. There is no need to list it in robots.txt.

Chris
04-16-2007, 09:47 AM
Or, if you really wanted to list it, but it in a secondary directory

/blocked/admin128/

Deny access to /blocked/ in your robots.txt file and no need to specify admin128

Nico
04-16-2007, 10:38 AM
If you like security through obscurity, you would never even link to your admin URL anywhere on your site, giving you no reason to even list it in robots.txt (since google wouldn't know about it)


Sure, but i see lots of people doing it, that maybe haven't realized that it's not the best thing to list their admins or other secret directories in robots.txt.


That's a great solution Chris. I didn't think of that.


Also, is it posible that a page that's not in your robots.txt and it's not listed in your site gets indexed? I don't know how...maybe by some hosting logs or something? Because if that's the case (i don't think so), just not listing your admin is not enough. In that case, Chris solution could work.

Westech
04-16-2007, 10:53 AM
Also, is it posible that a page that's not in your robots.txt and it's not listed in your site gets indexed? I don't know how...maybe by some hosting logs or something? Because if that's the case (i don't think so), just not listing your admin is not enough. In that case, Chris solution could work.

I think that Google also discovers new URL's by people visiting them with the Google toolbar installed. I don't have any proof of this, just anecdotal evidence of new pages not known by anyone and not linked to from anywhere on the web showing up in search results. The only way I can think of that Google finds them is through the toolbar reporting back when I visit the page.

MaxS
04-16-2007, 03:48 PM
I think that Google also discovers new URL's by people visiting them with the Google toolbar installed. I don't have any proof of this, just anecdotal evidence of new pages not known by anyone and not linked to from anywhere on the web showing up in search results. The only way I can think of that Google finds them is through the toolbar reporting back when I visit the page.
That's certainly an interesting theory. I'm not going to refute it, simply because I don't have evidence against it -- I've just never experienced that nor have I read about something like that happening.

Regarding in the initial question: I never block access to pages such as the admin panel. Considering it's not linked anywhere, I see no reason to do so. That said, any page that displays private information should have some sort of authentication system to begin with. I think enough visits to the page with Alexa's toolbar installed may cause Alexa to pick up on it and list it on the site's Alexa listing.

Kyle
04-16-2007, 05:54 PM
That's certainly an interesting theory. I'm not going to refute it, simply because I don't have evidence against it -- I've just never experienced that nor have I read about something like that happening.

Regarding in the initial question: I never block access to pages such as the admin panel. Considering it's not linked anywhere, I see no reason to do so. That said, any page that displays private information should have some sort of authentication system to begin with. I think enough visits to the page with Alexa's toolbar installed may cause Alexa to pick up on it and list it on the site's Alexa listing.

Alexa will index the page from your toolbar, I learned thsi the hard way with a non-password protected admin area (for an insignificant site of mine).

But I agree, I don't see a reason to put your admin area in your robots file.

How would google find it?

Jenita
04-24-2007, 05:26 AM
Well-behaved search engine programs called robots ("bots") or spiders are supposed to fetch the robots.txt file in your Web site root and follow the rules in it when they spider your site to index its content. You should probably have one, if only to reduce the 404 errors in your log files (even if it is blank or just put a rule in there saying “go to town, here’s my home page index.asp”, whatever you like). However, be careful. As you try to use robots.txt, there may be some considerations you didn't think of at first.

Inspection of most robots.txt files indicates that many admins try to keep spiders out of certain areas. Take a look at ibm.com/robots.txt. In this case, IBM doesn’t want you in the images, cgi-bin, scripts, etc. However, they also list out /Admin, /webmaster, etc. If you blindly put in all the places you don't want people to go in a robots.txt file, it is just a broadcast to someone who has bad intentions as to where they should direct their attention. Hey, here’s my control panel, NixGuy666. Caveat admintor.

Now, not to say in IBM's case they made such a basic mistake -- more likely they have a tripwire here. For example, say you have a directory in your robots.txt called /controlpanel-- this suggests it has some site backend. We might instead make /controlpanel a tripwire and have it cookie the user and then log a potential threatening visitor. This ‘hidden’ directory is only being looked at by those who know it is there when they learned about it from robots.txt sniffing. You might even consider putting a honeypot-style login page there to keep the potential snoop busy. In this case, false entries serve as the wire with some cans on it: the intruder comes stumbling in and alerts you to their potential bad intentions.

Consider adding a robots.txt file, if only to clean your logs. If you do, be smart and avoid revealing extra information and even consider adding tripwire URLs to monitor for. Someday, we hope to see true bot control. For now, we'll use robots.txt for its intended use… and other uses as well.