PDA

View Full Version : Looking for list of bad bots scraping sites and harvesting emails...



ASP-Hosting.ca
06-01-2005, 08:34 AM
Hi,

I'm looking for list of bad bots, scraping sites and harvesting emails.
I just want to put those in my .htaccess files with rewrite condition
and get rid of them once and for all.

If you have such list (or the .htaccess rewrite condition) please post it here.




Thanks,

Peter

Chris
06-02-2005, 10:51 AM
Here is my list.

Will you post your .htaccess file when done?



$agent = strtolower($HTTP_USER_AGENT);
if ((strstr($agent, "rip")) ||
(strstr($agent, "get")) ||
(strstr($agent, "icab")) ||
(strstr($agent, "wget")) ||
(strstr($agent, "lwp-request")) ||
(strstr($agent, "Wg")) ||
(strstr($agent, "ninja")) ||
(strstr($agent, "Wget/")) ||
(strstr($agent, "reap")) ||
(strstr($agent, "subtract")) ||
(strstr($agent, "offline")) ||
(strstr($agent, "xaldon")) ||
(strstr($agent, "ecatch")) ||
(strstr($agent, "msiecrawler")) ||
(strstr($agent, "rocketwriter")) ||
(strstr($agent, "httrack")) ||
(strstr($agent, "track")) ||
(strstr($agent, "teleport")) ||
(strstr($agent, "teleport pro")) ||
(strstr($agent, "webzip")) ||
(strstr($agent, "extractor")) ||
(strstr($agent, "lepor")) ||
(strstr($agent, "copier")) ||
(strstr($agent, "disco")) ||
(strstr($agent, "capture")) ||
(strstr($agent, "anarch")) ||
(strstr($agent, "snagger")) ||
(strstr($agent, "downloader")) ||
(strstr($agent, "superbot")) ||
(strstr($agent, "strip")) ||
(strstr($agent, "block")) ||
(strstr($agent, "saver")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "webhook")) ||
(strstr($agent, "webdup")) ||
(strstr($agent, "pavuk")) ||
(strstr($agent, "interarchy")) ||
(strstr($agent, "blackwidow")) ||
(strstr($agent, "w3mir")) ||
(strstr($agent, "plucker")) ||
(strstr($agent, "naver")) ||
(strstr($agent, "cherry"))){

Chris
06-02-2005, 08:27 PM
aipbot noxtrumbot

Todd W
06-02-2005, 11:53 PM
Hey Chris do you include that code on each page of your site or ???

ASP-Hosting.ca
06-03-2005, 07:05 AM
Thanks Chris,

Here is what I found on webmasterworld and I'm planning to use it.
If you see something wrong with the .htaccess below, please let me know.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]

moonshield
06-03-2005, 08:35 AM
Hey Chris do you include that code on each page of your site or ???

Just Include it.

James
06-03-2005, 05:11 PM
What would be the difference between using Chris' method and using .htaccess to keep the bad bots out?

Chris
06-03-2005, 06:55 PM
Mine was done with php, it wouldn't work for an ASP page or straight HTML. the .htaccess method would work with any apache served document.

However, php allows you to get more creative. In the past, when finding a bad robot, I've had PHP edit my .htaccess on the fly to perma-ban that IP. One time some guy was doing things particularly malicious and I setup an auto-redirect to a gay porn site. You could also add database control to parts of it.

I actually plan to switch to a mod_rewrite for my literature site though. I'm going to change it so that it generates .html pages daily or weekly as simply caching php queries no longer cuts it. As such I'll need a non-php option.

ASP-Hosting.ca
06-06-2005, 12:32 PM
I actually plan to switch to a mod_rewrite for my literature site though. I'm going to change it so that it generates .html pages daily or weekly as simply caching php queries no longer cuts it. As such I'll need a non-php option.

Sounds like you are doing very well with your literature site :). I remember you said that the site is on a server of its own. It's time to move to a server farm :).