LGF Lousy with Robots Again
One of the nice things about having your web server logs stored in a database is that you can easily see where the traffic is coming from, on a real-time basis. For example, by running this query:
SELECT ip, COUNT(*) AS count, referrer, useragent
FROM `log`
WHERE created >= ‘2007-05-28 18:00:00’
GROUP BY ip
ORDER BY count DESC
LIMIT 0 , 300
…I can see the IP addresses that have hit our site the most frequently during the past hour or so. By using this technique, I’ve identified several ill-behaved robots/web crawlers that have been going through every page on our site as fast as possible, sucking up bandwidth like deranged vacuum cleaners, for possibly nefarious purposes.
So far, I’ve blocked out two bots from China, one from Taiwan, one from the Netherlands, and one from … digg.com. The Digg robot was crawling every page at our site, and it has no identifying information; it’s masquerading as Internet Explorer. I only discovered the Digg connection by looking up the IP address. I don’t know what it was doing, but it was hitting our site every few seconds all day long, and that’s simply bad form. Now blocked.
UPDATE at 5/28/07 6:25:53 pm:
Interesting. Blocking that Digg robot also seems to prevent the Digg link from working. Apparently they’ve got a robot that does some kind of checking as soon as you click their link, probably to verify that it’s not a bogus submission. I removed the block on that one for now.
UPDATE at 5/28/07 6:44:41 pm:
And as soon as I removed the block, it started running through every page on our site again.