Loading...
 
Skip to main content

SysAdmin Blog

web crawler hell

bochmann Friday 25 of February, 2005
This evening I finally got fed up with the diverse web crawlers indexing crap on this server, like the Wiki history pages or the cached pages.

I remembered I had seen a post on the NANOG list about rewriting for crawlers, pointing to this old article on the webmasterworld forums (cache). Starting off from that, I came up with the following rule set for mod_rewrite, which is probably suboptimal, but seems to work in general:
 RewriteCond %{HTTP_USER_AGENT} Ask.Jeeves [OR]
 RewriteCond %{HTTP_USER_AGENT} ^FAST-WebCrawl [OR]
 RewriteCond %{HTTP_USER_AGENT} ^Googlebot [OR]
 RewriteCond %{HTTP_USER_AGENT} ^Gigabot [OR]
 RewriteCond %{HTTP_USER_AGENT} Yahoo\!\ Slurp [OR]
 RewriteCond %{HTTP_USER_AGENT} ^MMCrawler [OR]
 RewriteCond %{HTTP_USER_AGENT} ^msnbot [OR]
 RewriteCond %{HTTP_USER_AGENT} ^ia\_archiver [OR]
 RewriteCond %{HTTP_USER_AGENT} InfoSeek [OR]
 RewriteCond %{HTTP_USER_AGENT} Inktomi [NC,OR]
 RewriteCond %{HTTP_USER_AGENT} ^Scooter [OR]
 RewriteCond %{HTTP_USER_AGENT} ^Teoma [OR]
 RewriteCond %{HTTP_USER_AGENT} VoilaBot
 RewriteCond %{REQUEST_URI} ^/gallery/(do_command|slideshow|poll_results|add_comment|view_photo)\.php [OR]
 RewriteCond %{REQUEST_URI} ^/gallery/.*\?full\=1$ [OR]
 RewriteCond %{REQUEST_URI} ^/tiki/tiki-(pagehistory|view_cache|export_wiki_pages|lastchanges|wiki_rankings|orphan_pages)\.php
 RewriteRule ^/(.*) /dead-end.html? [R=seeother,L]

The REQUEST_URI conditions catch the most annoying requests to the image gallery and to the Wiki part of TikiWiki.