Bot Traffic Identification Strategies

152 views Asked by At

Bots. Hate 'em.

What I need to accomplish Like everyone else, I want to count up the number of times pieces of content get featured or are displayed as links without those numbers being artificially inflated by web bots.

Why we can't just rely on Google Analytics GA does a nice job of validating the numbers they report, however they only report the main url, and not "related items" since "related items" is different per page-view/per user, we need to track those ourselves.

GA is a good standard against which we can hold our numbers against, but that's it.

What I've done so far

  • Authenticated users are never hassled Firewall maintains IP Address blacklist
  • Applications keep track of known bots
  • Nightly roll up jobs trawl our logs looking for the following signals:
  • Sustained bursts of requests (high pages per second for more than x second)
  • Blocks of requests blocks of IP addresses (x.y.z.245, .246, .247, .248, etc. cannot coincidentally be traipsing through our content at the same time)
  • Pattern of landing page + requests for every page, in order, in rapid succession (humans rarely read every article, and not that quickly)

What I am looking for Not vague advice, but actionable algorithms, or best practices, or articles which describe how approaches were actually implemented, with at least some pseudo-code snippets. I don't expect to get a silver bullet, but I know there are ways to approach this problem that I have not seen. I just need to see a good white paper or something.

What I have seen a million times

  • "our company implements a multi-tiered approach with browser challenges and backend analytics, and blah blah blah" Sounds great, i'm sure the investors loved it, how about an actual example?
  • So tired of CIO-Speak. "we put hidden fields on our forms, it's called a honeypot!"

Anything actually actionable would be most gratefully appreciated!!!

0

There are 0 answers