Google Chrome usage of bloom filter

3.4k views Asked by At

I was reading an wikipedia article on the usage of Bloom Filters . It was mentioned in the article that Bloom filters are used by Google Chrome to detect whether a URL entered is malicious. Because of the presence of false positive

Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit ,a full check of the URL is performed

I am guessing full check means that Google stores a harsh table of the list of malicious URL and the URL is hashed to checked if it is present in the table. If this is the case , isent it better to just have the hash table instead of hash table + bloom filter??

Please enlightened me on this , is my version of full check correct ???

5

There are 5 answers

0
adrian.budau On

A bloom filter for all malicous URL's is small enough to be kept on your computer and even in memory. Because almost all sites you enter are not malicous it would be better if you wouldn't do an extra request for them, that's where the bloom filter comes in. You might not feel it but for slow internet connections it's very useful.

0
Robin On

Not only is the Bloom filter much smaller and faster than a web query, it also protects Google's malicious URL API from what would otherwise be a tremendous workload.

0
Aditya On

From my understanding, bloom filters can store data efficiently in a limited amount of space. The contract of bloom filter is that it does not return false negative, however based on the vector size of your bloom filter it might return you some false positives.

Just to make sure of the false positives, Google is either using Hashing or sending that urls to their servers to recheck the url there by eliminating the load of sending all the urls to their servers.

1
testgoofy On

Google stores a filter of listed malicous URLs in a Bloom-Filter.

See here:

0
Aditya On

A Bloom filter is a Probabilistic data structure tells us that the element either definitely is not in the set or may be in the set. Bloom filter takes less space (depends on hash functions configured and error rate) compared to the Hashmap. Hashmap can determine whether an element exists or not, whereas bloomfilter can only deterministically check for non existence of an element.

Let us look at the Google Chrome use case. When a user enters a URL it should validate whether the URL is a safe or not. To validate the URL, chrome can make a call to the google serve (internally google can maintain any data structure to find this out). However, the challenge with this approach is multiple fold.For every URL request served on chrome the URL validation happens through Google Server which adds additional dependency on google servers, network roundtrip time, and requirement to maintain high availability to validate the URLs for all the URLs fired from chrome browsers across the world.

Since this data does not change very often (may be updated in an hour or so), chrome might have considered bundling all the malicious sites data as a bloom filter and google might sync this data periodically with the clients(malicious sites are less compared to the full blown websites. ). When user opens a url in chrome it checks the bloom filter if the URL does not exist it is safe. If it exists then bloom filter is not sure about it so the traffic goes to the google server for validating (this traffic will be way less compared to routing all the traffic).