PDA

View Full Version : WebReaper & other programs



chelle
12-03-2003, 06:18 PM
Hello,

I am new here and hope to find some help. I recently re-designed my site, added a ton of new information, plus added a forum. This morning while checking my logs I discovered that a user downloaded my entire site including the forum using WebReaper. According to the logs, the user did not obtain admin or user information (as far as I can tell).

Is there a way to prevent people from downloading my entire site using programs like this?

Any help is appreciated.

TIA,
chelle

Chris
12-03-2003, 07:38 PM
There most certainly is. Its not fool-proof, since many of these programs can pretend to be legit browsers, but there are things you can do.

What type of server do you have and what technologies are you using to generate your site? (HTML, ASP, PHP, etc?)

chelle
12-03-2003, 08:14 PM
Hello Chris,

Thank you so much for replying.

We have an apache server, using php and mysql. The forum is generated using php 4.0.6 and mysql, and the main site itself is htm.

If you need to look at the site you can find it here: http://finarama.com/home.htm and there is a link to the forum from that page.

If I were to convert the main site to php and make it to where users had to log in to view the material, would that prevent these programs from downloading the entire site? I was also wondering if I made the forum available to only the logged in user, would that also help?

Thank you again for your response.

chelle

Chris
12-03-2003, 09:21 PM
Here is what I do on one site.



$agent = strtolower($HTTP_USER_AGENT);
if ((strstr($agent, "rip")) ||
(strstr($agent, "get")) ||
(strstr($agent, "icab")) ||
(strstr($agent, "wget")) ||
(strstr($agent, "ninja")) ||
(strstr($agent, "reap")) ||
(strstr($agent, "subtract")) ||
(strstr($agent, "offline")) ||
(strstr($agent, "xaldon")) ||
(strstr($agent, "ecatch")) ||
(strstr($agent, "msiecrawler")) ||
(strstr($agent, "rocketwriter")) ||
(strstr($agent, "httrack")) ||
(strstr($agent, "track")) ||
(strstr($agent, "teleport")) ||
(strstr($agent, "teleport pro")) ||
(strstr($agent, "webzip")) ||
(strstr($agent, "extractor")) ||
(strstr($agent, "lepor")) ||
(strstr($agent, "copier")) ||
(strstr($agent, "disco")) ||
(strstr($agent, "capture")) ||
(strstr($agent, "anarch")) ||
(strstr($agent, "snagger")) ||
(strstr($agent, "superbot")) ||
(strstr($agent, "strip")) ||
(strstr($agent, "block")) ||
(strstr($agent, "saver")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "webhook")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "pavuk")) ||
(strstr($agent, "interarchy")) ||
(strstr($agent, "blackwidow")) ||
(strstr($agent, "w3mir")) ||
(strstr($agent, "plucker")) ||
(strstr($agent, "cherry"))){
header("Location: http://www.example.com/banned/banned.php");
exit();
}


Including that on every page should help.

Another thing you can do is after you catch someone doing this you can ban their IP address using .htaccess

GCT13
12-03-2003, 10:28 PM
That's a nice little snipet of code.

Do you run into this problem with regularity Chirs?

Chris
12-04-2003, 06:21 AM
Yes, mostly on my literature site but also on other sites. I hate "offline browsers."

chelle
12-04-2003, 05:49 PM
Hi Chris,

Thank you so much. Can you explain how I insert this into my htm pages? Also, below is a normal line from my log. I notice it uses the "get" command. Is that the same as the "get" command in the code you just posted? How will it affect my pages when loading?

Again,
Thank you so much,

chelle


207.213.164.43 - - [04/Dec/2003:14:14:46 -0800] "GET /tba/identification.htm HTTP/1.1" 200 41468 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; yie6_SBC; .NET CLR 1.1.4322)"

When previewing, the code makes it go off the page. Sorry about this.:(

Chris
12-04-2003, 09:23 PM
That only works on PHP pages. To install it just include it at the top (and register_globals needs to be turned on).

a "Get" method is the normal method for pulling up webpages. Nothing abnormal about that.

chromate
12-05-2003, 04:49 AM
My site's just been ripped as well.

This is one to add to the list...

WebCopier v3.6

Chris
12-05-2003, 06:57 AM
Its on the list.

Anything with the word "copier" is.

chelle
12-05-2003, 08:21 AM
I'm wondering if it is possible to make this into an external file and link to it from all pages?

chromate
12-05-2003, 08:25 AM
Originally posted by Chris
Its on the list.

Anything with the word "copier" is.

Oh yeah, I missed it.

chromate
12-05-2003, 08:27 AM
Originally posted by chelle
I'm wondering if it is possible to make this into an external file and link to it from all pages?

Yes, you can do that and just include(...) it on each page. Make sure it's the very first thing on the page though before anything gets sent to the browser. Otherwise it will fail if it tries to send the header.

chelle
12-05-2003, 08:49 AM
I took one of my htm pages into dreamweaver and resaved it as a php page rather than htm. I know absolutely nothing about php and when I looked at the source it looks the same as any regular htm, html page. I'm assuming that this is correct.

To insert this code correctly using an external file it would like something like this? If not, I'm sorry, but I'll need to be told exactly how to do this correctly.


(link rel="$agent" type="text/txt" href="anydirectory/nameoffile.txt")
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

Thanks for your help.

chelle

chromate
12-05-2003, 09:23 AM
No, you're getting really confused here :)

PHP is a scripting language. It's what we call a "server side" language. This means that it runs on the server, before the results are sent to the user's browser, as standard html. This is why when you look at the source code it looks like regular html.

To illustrate this by refering to the PHP code we're dealing with here, here's what happens:

The user arrives at your site. The user's browser makes a request for one of your site's pages. The web server then recieves that request for the page. If it's a .php file (which it is in this case) any php code within the file is run. In our case, the user's "user agent" (browser) is checked to make sure it's not a site ripper. If it is a site ripper the server forwards the user to another page informing them that they can't access the site. If not, then the normal page content is sent to the browser.

Hope that makes how it works a little clearer.

Now, we want to include the "anti site ripper" php code in every page of the site. So use this php include code at the top of each of your pages:



<?php include("anti_rip.php"); ?>


This assumes that the anti-rip.php file contains the exact code Chris pasted above. It also assumes that the ani_rip.php file is in the same directory as the php file that's trying to include it. This will most likely not be convenient, so just include the path to the anti_rip.php file in the quoted section of the function above.

What's happening here is the page is being constructed by "including" other files, and parsing (running) them before being sent to the browser as HTML.

Mike
12-05-2003, 09:32 AM
Not that it matters that much, but could anyone tell me what the 'strtolower' part means please?

chromate
12-05-2003, 09:38 AM
It returns a lowercase version of the string.

For example, if it wasn't used, "copier" would not equal "Copier". So to make things simple, and as we don't know specifically what we're searching for in this case, it's easier just to make everything lower case. Then if it contains the sub string, it will always be true regardless of being upper / lower case.

chelle
12-05-2003, 09:54 AM
I see. Thank you so much for the explanation. I think I'm going to have to take a crash coarse in converting all my pages to php. Gosh... I feel overwhelmed at the moment. PHP seems to have all these different pages that work together. I notice with my forum pages, I have the a cfg file, the includes page, I have a header, body, footer page and then the view forum, view topic, ect. How it changes the body page I have no idea. I hear that php is fairly simple and once I understand it... I'm sure I'll love it.

Well... off to php school. ;)

Thank so much for all the help. If I have a question about php... you'll find me posting to one of the other sections of this forum. ;)

chelle

Mike
12-05-2003, 12:46 PM
Originally posted by chromate
It returns a lowercase version of the string.

For example, if it wasn't used, "copier" would not equal "Copier". So to make things simple, and as we don't know specifically what we're searching for in this case, it's easier just to make everything lower case. Then if it contains the sub string, it will always be true regardless of being upper / lower case.

ta Chromate:)

chelle
12-05-2003, 02:29 PM
Originally posted by Chris
(and register_globals needs to be turned on).

Hi Chris,

We're running into a problem trying to get this to work. We think it might be the above. Could you elaborate on this a bit for me please? Where do I find this?

Thanks,
chelle

GCT13
12-05-2003, 02:34 PM
Try replacing this: $HTTP_USER_AGENT

With this: $_SERVER['HTTP_USER_AGENT']

chelle
12-05-2003, 07:59 PM
Hi...

I've tried creating the anti_rip.php page with both $HTTP_USER_AGENT and $_SERVER['HTTP_USER_AGENT']

I continue to get a parse error on line 2. Here is the code placed in a php page.


<? php
$agent = strtolower($HTTP_USER_AGENT);
if ((strstr($agent, "rip")) ||
(strstr($agent, "get")) ||
(strstr($agent, "icab")) ||
(strstr($agent, "wget")) ||
(strstr($agent, "ninja")) ||
(strstr($agent, "reap")) ||
(strstr($agent, "subtract")) ||
(strstr($agent, "offline")) ||
(strstr($agent, "xaldon")) ||
(strstr($agent, "ecatch")) ||
(strstr($agent, "msiecrawler")) ||
(strstr($agent, "rocketwriter")) ||
(strstr($agent, "httrack")) ||
(strstr($agent, "track")) ||
(strstr($agent, "teleport")) ||
(strstr($agent, "teleport pro")) ||
(strstr($agent, "webzip")) ||
(strstr($agent, "extractor")) ||
(strstr($agent, "lepor")) ||
(strstr($agent, "copier")) ||
(strstr($agent, "disco")) ||
(strstr($agent, "capture")) ||
(strstr($agent, "anarch")) ||
(strstr($agent, "snagger")) ||
(strstr($agent, "superbot")) ||
(strstr($agent, "strip")) ||
(strstr($agent, "block")) ||
(strstr($agent, "saver")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "webhook")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "pavuk")) ||
(strstr($agent, "interarchy")) ||
(strstr($agent, "blackwidow")) ||
(strstr($agent, "w3mir")) ||
(strstr($agent, "plucker")) ||
(strstr($agent, "cherry"))){
header("Location: http://www.finarama.com/banned.php");
exit();
}
?>

I've created a test page using my home page. It shows up in the browser just fine, but I get this space at the top of the page. Below is what the top of my page looks like.


<?php include('anti_rip.php');?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>

Then I have the rest of the html code. Anyhow... what am I doing wrong? Here is the url to the test page: http://finarama.com/phptest.php


TIA,
chelle

chelle
12-05-2003, 08:04 PM
Oh my! I thought to take out the <? php at the top and the ?> at the bottom of the anti_rip.php page and now I don't get a parse error, but I do have a huge space at the top of my test page.

Help...

chelle

Oh... and when I view the source of the test php page, it shows the anti_rip.php page. Is this right?

GCT13
12-05-2003, 11:37 PM
That is wild. Okay chelle, you're so very close.

You do need the <?php ...... ?> in the anti_rip.php include, otherwise the server won't know that there is php code to process. Without the php tags, it'll just spit out the the code as plain text, as is happening now.

From the code up top looks like you had a space in "<?php" which would explain the parse error in line 2. There is no space in the opening php tag: <?php

I think that should do it. Also just in case, in phptest.php, put a space before "?>" like this:

<?php include('anti_rip.php'); ?>

Let us know how it goes. Good luck!

:D

chelle
12-06-2003, 12:54 AM
Thank you so much! The test page looks perfect now. But I have one other problem. I downloaded one of the ripper programs so that I can test it. I am using Teleport Pro and I can still download every bit of content on this page. Am I testing correctly?

TIA,

Chelle

Chris
12-06-2003, 06:25 AM
You don't even need <?php <? works fine by itself.

Put in some echos in the program for a test.



<? php
$agent = strtolower($HTTP_USER_AGENT);
echo $agent;
if ((strstr($agent, "rip")) ||
(strstr($agent, "get")) ||
(strstr($agent, "icab")) ||
(strstr($agent, "wget")) ||
(strstr($agent, "ninja")) ||
(strstr($agent, "reap")) ||
(strstr($agent, "subtract")) ||
(strstr($agent, "offline")) ||
(strstr($agent, "xaldon")) ||
(strstr($agent, "ecatch")) ||
(strstr($agent, "msiecrawler")) ||
(strstr($agent, "rocketwriter")) ||
(strstr($agent, "httrack")) ||
(strstr($agent, "track")) ||
(strstr($agent, "teleport")) ||
(strstr($agent, "teleport pro")) ||
(strstr($agent, "webzip")) ||
(strstr($agent, "extractor")) ||
(strstr($agent, "lepor")) ||
(strstr($agent, "copier")) ||
(strstr($agent, "disco")) ||
(strstr($agent, "capture")) ||
(strstr($agent, "anarch")) ||
(strstr($agent, "snagger")) ||
(strstr($agent, "superbot")) ||
(strstr($agent, "strip")) ||
(strstr($agent, "block")) ||
(strstr($agent, "saver")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "webhook")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "pavuk")) ||
(strstr($agent, "interarchy")) ||
(strstr($agent, "blackwidow")) ||
(strstr($agent, "w3mir")) ||
(strstr($agent, "plucker")) ||
(strstr($agent, "cherry"))){
echo "2";
header("Location: http://www.finarama.com/banned.php");
exit();
}
?>

chromate
12-06-2003, 07:32 AM
Technically, you should use <?php though ;)

chelle
12-06-2003, 11:49 AM
o.k., I've added the echos and tested it. It still downloaded the page. I took out <?php <? and tested again. That didn't work. Then I changed ($_SERVER['HTTP_USER_AGENT']); to ($HTTP_USER_AGENT); and it put one line on the top of my page. I tested it this way also and this did not work. I checked for any spaces and there was none.

I'm sorry to keep bothering everyone about this, but I'm nearly there.... I think.

Thank you all so much for helping me figure this out.

chelle

Chris
12-06-2003, 12:50 PM
When did things echo out?

Did you ever see "2" echo out?

When something was printed at the top of your page, what was it?

Your copy of teleport might be configured to spoof (lie about) it's user agent and that could be why your test isn't working.

chelle
12-06-2003, 01:51 PM
hmmm... I'm not sure what you mean by "echo out" and what was printed. Remember... php is all new to me and all I did was download the page with all it's contents.

Perhaps I'm not using the program correctly or maybe it is recognizing my IP address and allowing me to do this. I really don't know. I think I'll go ahead and add the code in all the pages of the main directory and I'll have another computer try to grab those files.

If this doesn't work, then I'll have to live with the fact that my site might be downloaded again. That's the chance we take when we have information on the net isn't it?

I really do appreciate everyones help here. Thank you all so much!

Chelle

chelle
12-06-2003, 02:12 PM
I just looked at the logs to see what the server was recording. It does seem that it is spoofing something for it doesn't even come up.

Here is what it says for my IP address:

207.213.165.114 - - [06/Dec/2003:10:40:18 -0800] "GET /phptest.php HTTP/1.0" 200 39238 "-" "Mozilla/5.0 (compatible; MSIE 5.0)"
207.213.165.114 - - [06/Dec/2003:10:39:51 -0800] "GET /phptest.php HTTP/1.1" 200 39374 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; AT&T WNS5.0)"


I'm using an IE browser 6.0. I have other browsers also... perhaps I should close all browsers and then see what my log says.

Chelle

chelle
12-06-2003, 02:29 PM
I just closed everything and just ran Teleport Pro. Here is what it records on the server log:

207.213.165.114 - - [06/Dec/2003:13:01:53 -0800] "GET /phptest.php HTTP/1.0" 200 39238 "-" "Mozilla/5.0 (compatible; MSIE 5.0)"

It does look like it is spoofing, although I have no idea how it does that. :confused:

Chelle

chelle
12-06-2003, 02:35 PM
I'm sorry... I know I'm starting to jibber jabber, but I realized one thing that might make a difference.

I have not created a robots. txt file yet. I'm new to the server side of things and didn't realize I needed to do this and is on my "to do" list. The teleport program tries to read this non-existing file first before it gather the other files. Could this be why it is allowed to download files?

Here is the server information:
207.213.165.114 - - [06/Dec/2003:13:01:52 -0800] "GET /robots.txt HTTP/1.0" 404 273 "-" "-"


TIA,
chelle

Chris
12-06-2003, 03:05 PM
No. Teleport is a malicious program. Any program that allows a user to mask it's user agent is not legitimate. Its like using a fake ID.

Teleport will not honor a robots.txt file.

Many of the programs listed in that little script can spoof the user agent, if they do that there is nothing you can do about it.

Well, actually there is a couple things you can do.

1. Monitor your stats and if you see a large number of requests from a single IP address investigate. If the user agent looks normal but the user is behaving like a spider (viewing each page once) then ban it.

2. Configure some kind of bottlenecking feature with apache where a single IP is not allowed to request more than x files a minute. (I forgot what the name of this is).

chelle
12-06-2003, 03:40 PM
Thanks again. I'm understanding more and more as I go. I decided to do a search on "fake browsers" just to see what it turns up. I found something interesting and quite funny at the same time. I can certainly relate to what this person did to those who tried to "grab" stuff off his server. But ultimately he states there was a better way.


There was a time when I was using a browser detection in my Server Side Includes that would basically spill about 200K of garbage down the throat of any spambot that came our way. Okay, I confess that revenge felt good, but when I thought it over I realized that I was placing more strain on our server, and by providing a huge list of bogus e-mail addresses, was placing a strain on the SMTP server that the spammer would eventually hijack. It was then that I decided to start using the RewriteEngine module.

Anyhow... I though perhaps everyone who has tried to help me with this and those who have been reading this thread might find this url interesting. I'm going to read up on it and maybe this is what Chris was referring to about the apache server.

http://bignosebird.com/apache/a9.shtml

Thanks again everyone. You've all been so kind! :D

Chelle

Chris
12-06-2003, 06:33 PM
That basically does the same thing my PHP script does (but it uses regular expressions which probably have more server overhead). It'll still not work if someone is spoofing their user agent.

I think what I was talking about is called Throttle.