LGF

more options

  

Advertisement

  

Link address:
Link title:
Description: 
Remaining:

Buggy Diggbot Breaks the Rules

Mon, Apr 28, 2008 at 12:36:30 pm PDT

A robot from Digg.com has been rapidly running through everything at LGF, including images, with multiple hits per second. It’s doing this despite the following lines in our robots.txt file:

User-agent: *
Crawl-delay: 600

This rule is supposed to limit the amount of hits from all robots to no more than one every ten minutes, and the Digg crawler is blatantly ignoring it. From the files it’s requesting, it looks as if it’s out of control and misreading something.

Now blocked. The IP address of the ill-behaved bot: 64.191.203.34.

UPDATE at 4/28/08 1:47:15 pm:

Analyzed the logs some more and figured it out — there’s nothing nefarious going on. The crawler seems to have a bug; it is not correctly reading the BASE tag in our pages, and it was trying to find images in a nonexistent directory as a result. I fixed the problem by adding a couple of mod_rewrite rules to our htaccess file, and the Digg crawler is now unblocked.

(It was breaking the robots.txt rule, though, as it thrashed around trying to find files that didn’t exist.)

86 comments

^ back to top ^

log in
Name:
Pass:

Register (closed) Forgot Your Password? My Account Re-send Confirmation (To log in, cookies must be enabled in your browser!)

► LGF Headlines

► Top 10 Comments

► Bottom Comments

► Recent Comments

► Tools/Info

► LGF Hits

► Slideshows

► Resources

► Never Forget

► Statistics

► Tag Cloud

► Contact

You must have Javascript enabled to use the contact form.
Your email:

Subject:

Message:


Messages may be published in our weblog, unless you request otherwise.
Tech Note:
Using the Contact Form

► News/Opinion

► Blogs

Anti-idiotarian headquarters.