Results 1 to 9 of 9

Thread: Looking for list of bad bots scraping sites and harvesting emails...

  1. #1

    Looking for list of bad bots scraping sites and harvesting emails...

    Hi,

    I'm looking for list of bad bots, scraping sites and harvesting emails.
    I just want to put those in my .htaccess files with rewrite condition
    and get rid of them once and for all.

    If you have such list (or the .htaccess rewrite condition) please post it here.




    Thanks,

    Peter

  2. #2
    Administrator Chris's Avatar
    Join Date
    Feb 2003
    Location
    East Lansing, MI USA
    Posts
    7,055
    Here is my list.

    Will you post your .htaccess file when done?

    $agent = strtolower($HTTP_USER_AGENT);
    if ((strstr($agent, "rip")) ||
    (strstr($agent, "get")) ||
    (strstr($agent, "icab")) ||
    (strstr($agent, "wget")) ||
    (strstr($agent, "lwp-request")) ||
    (strstr($agent, "Wg")) ||
    (strstr($agent, "ninja")) ||
    (strstr($agent, "Wget/")) ||
    (strstr($agent, "reap")) ||
    (strstr($agent, "subtract")) ||
    (strstr($agent, "offline")) ||
    (strstr($agent, "xaldon")) ||
    (strstr($agent, "ecatch")) ||
    (strstr($agent, "msiecrawler")) ||
    (strstr($agent, "rocketwriter")) ||
    (strstr($agent, "httrack")) ||
    (strstr($agent, "track")) ||
    (strstr($agent, "teleport")) ||
    (strstr($agent, "teleport pro")) ||
    (strstr($agent, "webzip")) ||
    (strstr($agent, "extractor")) ||
    (strstr($agent, "lepor")) ||
    (strstr($agent, "copier")) ||
    (strstr($agent, "disco")) ||
    (strstr($agent, "capture")) ||
    (strstr($agent, "anarch")) ||
    (strstr($agent, "snagger")) ||
    (strstr($agent, "downloader")) ||
    (strstr($agent, "superbot")) ||
    (strstr($agent, "strip")) ||
    (strstr($agent, "block")) ||
    (strstr($agent, "saver")) ||
    (strstr($agent, "webdup")) ||
    (strstr($agent, "webhook")) ||
    (strstr($agent, "webdup")) ||
    (strstr($agent, "pavuk")) ||
    (strstr($agent, "interarchy")) ||
    (strstr($agent, "blackwidow")) ||
    (strstr($agent, "w3mir")) ||
    (strstr($agent, "plucker")) ||
    (strstr($agent, "naver")) ||
    (strstr($agent, "cherry"))){
    Chris Beasley - My Guide to Building a Successful Website[size=1]
    Content Sites: ABCDFGHIJKLMNOP|Forums: ABCD EF|Ecommerce: Swords Knives

  3. #3
    Administrator Chris's Avatar
    Join Date
    Feb 2003
    Location
    East Lansing, MI USA
    Posts
    7,055
    aipbot noxtrumbot
    Chris Beasley - My Guide to Building a Successful Website[size=1]
    Content Sites: ABCDFGHIJKLMNOP|Forums: ABCD EF|Ecommerce: Swords Knives

  4. #4
    4x4
    Join Date
    Oct 2004
    Posts
    1,043
    Hey Chris do you include that code on each page of your site or ???

  5. #5
    Thanks Chris,

    Here is what I found on webmasterworld and I'm planning to use it.
    If you see something wrong with the .htaccess below, please let me know.

    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule ^.* - [F,L]

  6. #6
    Registered Member moonshield's Avatar
    Join Date
    Aug 2004
    Location
    Charlotte
    Posts
    1,281
    Hey Chris do you include that code on each page of your site or ???
    Just Include it.

  7. #7
    I'm the oogie boogie man! James's Avatar
    Join Date
    Aug 2004
    Location
    Canada
    Posts
    1,566
    What would be the difference between using Chris' method and using .htaccess to keep the bad bots out?

  8. #8
    Administrator Chris's Avatar
    Join Date
    Feb 2003
    Location
    East Lansing, MI USA
    Posts
    7,055
    Mine was done with php, it wouldn't work for an ASP page or straight HTML. the .htaccess method would work with any apache served document.

    However, php allows you to get more creative. In the past, when finding a bad robot, I've had PHP edit my .htaccess on the fly to perma-ban that IP. One time some guy was doing things particularly malicious and I setup an auto-redirect to a gay porn site. You could also add database control to parts of it.

    I actually plan to switch to a mod_rewrite for my literature site though. I'm going to change it so that it generates .html pages daily or weekly as simply caching php queries no longer cuts it. As such I'll need a non-php option.
    Chris Beasley - My Guide to Building a Successful Website[size=1]
    Content Sites: ABCDFGHIJKLMNOP|Forums: ABCD EF|Ecommerce: Swords Knives

  9. #9
    Quote Originally Posted by Chris
    I actually plan to switch to a mod_rewrite for my literature site though. I'm going to change it so that it generates .html pages daily or weekly as simply caching php queries no longer cuts it. As such I'll need a non-php option.
    Sounds like you are doing very well with your literature site . I remember you said that the site is on a server of its own. It's time to move to a server farm .

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •