Buggy Diggbot Breaks the Rules
A robot from Digg.com has been rapidly running through everything at LGF, including images, with multiple hits per second. It’s doing this despite the following lines in our robots.txt file:
User-agent: *
Crawl-delay: 600
This rule is supposed to limit the amount of hits from all robots to no more than one every ten minutes, and the Digg crawler is blatantly ignoring it. From the files it’s requesting, it looks as if it’s out of control and misreading something.
Now blocked. The IP address of the ill-behaved bot: 64.191.203.34.
UPDATE at 4/28/08 1:47:15 pm:
Analyzed the logs some more and figured it out — there’s nothing nefarious going on. The crawler seems to have a bug; it is not correctly reading the BASE tag in our pages, and it was trying to find images in a nonexistent directory as a result. I fixed the problem by adding a couple of mod_rewrite rules to our htaccess file, and the Digg crawler is now unblocked.
(It was breaking the robots.txt rule, though, as it thrashed around trying to find files that didn’t exist.)