Notional Slurry Logo

On the dangers of spidering badly

Somebody who lives at 208.101.36.2 decided late yesterday to run their l33t script they cobbled together for their sixth-grade class. If that person actually comes by to read this: U R such a n00b. Scat.

When writing a spidering bot, do not attempt to serially follow every link on a content page without a substantial time delay between requests. In modern websites especially, such links often represent javascript-driven disclosure effects, not actual hyperlinks. By spawning 5760 stupid links in a few seconds, your dumb bot is not merely going to collect a lot of redundant data, but in addition will seriously piss off the admins.

If you apply this stupid fast spidering technique to a dynamically-generated site, you will find that something bad will almost certainly happen.

Eventually these bad things will happen to you.

Thus beginneth the lesson.

Alex said,

October 16, 2006 @ 10:06 pm

So … what -is- a good backoff interval [assuming the links you want to follow actually make sense to follow] ? Or, for that matter, why does fast spidering piss off the admin — because it looks like a DoS attack ?

Tozier said,

October 16, 2006 @ 10:39 pm

I’m not sure what an appropriate interval is, but in this case the pair.com admin told me there were 315 concurrent blog processes running, since every time a link is clicked one launches, and so many were launched in such a short time that one didn’t finish before the next started.

Nemo said,

December 29, 2006 @ 4:38 am

No kidding! They’re still in action, hitting a server I run for ~900 pages in less than three minutes; I wondered why the load was over 2…

server seems busy, (you may need to increase StartServers, or Min/MaxSpareServers), spawning 8 children, there are 4 idle, and 103 total children

It’s like a one-person ./ing…

# apf -d 208.101.36.2

RSS feed for comments on this post · TrackBack URI

Leave a Comment