Somebody who lives at 208.101.36.2 decided late yesterday to run their l33t script they cobbled together for their sixth-grade class. If that person actually comes by to read this: U R such a n00b. Scat.
When writing a spidering bot, do not attempt to serially follow every link on a content page without a substantial time delay between requests. In modern websites especially, such links often represent javascript-driven disclosure effects, not actual hyperlinks. By spawning 5760 stupid links in a few seconds, your dumb bot is not merely going to collect a lot of redundant data, but in addition will seriously piss off the admins.
If you apply this stupid fast spidering technique to a dynamically-generated site, you will find that something bad will almost certainly happen.
Eventually these bad things will happen to you.
Thus beginneth the lesson.
So … what -is- a good backoff interval [assuming the links you want to follow actually make sense to follow] ? Or, for that matter, why does fast spidering piss off the admin — because it looks like a DoS attack ?
I’m not sure what an appropriate interval is, but in this case the pair.com admin told me there were 315 concurrent blog processes running, since every time a link is clicked one launches, and so many were launched in such a short time that one didn’t finish before the next started.
No kidding! They’re still in action, hitting a server I run for ~900 pages in less than three minutes; I wondered why the load was over 2…
It’s like a one-person ./ing…
# apf -d 208.101.36.2