How to validate GoogleBot

1.6k views Asked by At

I want to prevent data harvesting in my site (except googlebot of course). I am guessing relying on the UserAgent of GB is not strong enough (every bot can fake it)

How can I still authenticate GoogleBot to avoid fakes.

4

There are 4 answers

0
Cody Gray - on strike On BEST ANSWER

The official way is by using a combination of forward and reverse DNS lookups; they can't fake that!

More information is here from Google's Webmaster blog: How to verify Googlebot

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

However, I recommend caching the results of this per-IP lookup and only performing it periodically so as not to introduce too much overhead through your validation process.

0
Rich Adams On

There's a post on the official Google Webmaster Blog which explains the "official way to authenticate Googlebot".

Telling webmasters to use DNS to verify on a case-by-case basis seems like the best way to go. I think the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name; eg:

> host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com.

> host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1

I don't think just doing a reverse DNS lookup is sufficient, because a spoofer could set up reverse DNS to point to crawl-a-b-c-d.googlebot.com.

0
Igal Zeifman On

Our company (Incapsula) recently did a study of Googlebot activity that showed an average of 21% of Googlebot impressination attempts. (75% of these were directly harmful)

http://www.incapsula.com/the-incapsula-blog/item/369-was-that-really-a-google-bot-crawling-my-site

Having said that, the vulnerability continues to exist only due to carelessness as an above-mentioned verification method is 100% full-proof.

0
user2253402 On

Google bot use following ranges -

203.208.60.0/24, 66.249.64.0/20, 2001:4860:4801:2:6b00:6006:1300:b075, 2001:4860:4801:5:1000:6006:1300:b075, 2001:4860:4801:6:e300:6006:1300:b075, 2001:4860:4801:2001::6006:1300:b075, 2001:4860:4801:2002::6006:1300:b075

Bing Bot IP Ranges -

65.52.104.0/24, 65.52.108.0/22, 65.55.24.0/24, 65.55.52.0/24, 65.55.55.0/24, 65.55.213.0/24, 131.253.24.0/22, 131.253.46.0/23, 157.55.16.0/23, 157.55.18.0/24, 157.55.32.0/22, 157.55.36.0/24, 157.55.48.0/24, 157.55.109.0/24, 157.55.110.40/29, 157.55.110.48/28, 157.56.92.0/24, 157.56.93.0/24, 157.56.94.0/23, 157.56.229.0/24, 199.30.16.0/24, 207.46.12.0/23, 207.46.192.0/24, 207.46.195.0/24, 207.46.199.0/24, 207.46.204.0/24

Use link below for more information -

http://myip.ms/info/bots/Google_Bing_Yahoo_Facebook_etc_Bot_IP_Addresses.html

.