On the dangers of spidering badly

Some­body who lives at 208.101.36.2 decided late yes­ter­day to run their l33t script they cob­bled together for their sixth-​​grade class. If that per­son actu­ally comes by to read this: U R such a n00b. Scat.

When writ­ing a spi­der­ing bot, do not attempt to seri­ally fol­low every link on a con­tent page with­out a sub­stan­tial time delay between requests. In mod­ern web­sites espe­cially, such links often rep­re­sent javascript-​​driven dis­clo­sure effects, not actual hyper­links. By spawn­ing 5760 stu­pid links in a few sec­onds, your dumb bot is not merely going to col­lect a lot of redun­dant data, but in addi­tion will seri­ously piss off the admins.

If you apply this stu­pid fast spi­der­ing tech­nique to a dynamically-​​generated site, you will find that some­thing bad will almost cer­tainly happen.

Even­tu­ally these bad things will hap­pen to you.

Thus begin­neth the lesson.

3 thoughts on “On the dangers of spidering badly

  1. So … what –is– a good back­off inter­val [assum­ing the links you want to fol­low actu­ally make sense to fol­low] ? Or, for that mat­ter, why does fast spi­der­ing piss off the admin — because it looks like a DoS attack ?

  2. I’m not sure what an appro­pri­ate inter­val is, but in this case the pair​.com admin told me there were 315 con­cur­rent blog processes run­ning, since every time a link is clicked one launches, and so many were launched in such a short time that one didn’t fin­ish before the next started.

  3. No kid­ding! They’re still in action, hit­ting a server I run for ~900 pages in less than three min­utes; I won­dered why the load was over 2…

    server seems busy, (you may need to increase Start­Servers, or Min/​MaxSpareServers), spawn­ing 8 chil­dren, there are 4 idle, and 103 total children

    It’s like a one-​​person ./​ing…

    # apf –d 208.101.36.2

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>