Accuracy of the geolite_city_bq_b2 dataset

956 views Asked by At

I believe there are inaccuracies in the BigQuery fh-bigquery.geocode.geolite_city_bq_b2 dataset, and am curious if others have noticed this too.

Background: I have the BigQuery code from Ramtin M. Seraj running, and his/my logic appear to be sound. However there are IP addresses known to represent certain places, e.g. Tokyo @ 150.249.199.17, but which are indicated by Ramtin's query to be in Rochester NY-USA or Ottawa ON-CA. If the query logic is sound then the only conclusion is that the underlying Geolite dataset is not.

To verify, look at results of this query:

SELECT *
FROM `fh-bigquery.geocode.geolite_city_bq_b2b`
WHERE classB = 38649

Note from these results that startIp = 150.245.0.0 and endIp = 150.249.255.255, therefore address 150.249.199.17 is within this IP range.

Now compare with the results from https://ipinfo.io/150.249.199.17, and also with the results from the following BigQuery. Notice that all computed values, such as IPV4_TO_INT64() of the IP address, fall within the ranges returned by the query above.

SELECT '150.249.199.17' as ipAddress
  , NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17')) AS clientIpNum_int
  , TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) AS classB
  , CAST(TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) as INT64) as client_classB_int

p.s. I would upvote the first answer, or add a comment, but I don't have enough Reputons yet!

1

There are 1 answers

1
Felipe Hoffa On

2019, much improved answer:

#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
  SELECT ip, country_name, c
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC

I'm about to publish a much improved version of Geolite in BigQuery. Stay tuned to https://twitter.com/felipehoffa and https://medium.com/@hoffa. And I'll update this answer then too.

With that said, to answer the accuracy part which titles this question, Maxmind says:

GeoLite2 databases are free IP geolocation databases comparable to, but less accurate than, MaxMind’s GeoIP2 databases