Finding possibility of occurrence of words in a document using bloom filter

1.3k views Asked by At

I have a list of words like - list1 = [boy, apple, mango, car] and I have two documents with following content:

document1= The boy driving a car ate apple and mango.
document2= The boy ate an apple.

I just need to find out if the given list of words exist in the document.

In order to check if the words in list1 exist in document I can create a bloom filter for list1 (say bloomlist1) and a bloom filter for document1 (say bloomdocument1). Then I can perform bitwise anding and check if the result is same as bloomlist1. If its same, I can say that all words in list1 exist in document1. So, it will return True.

If I do the same approach for document2 i.e. do bitwise anding, then the result will be False. But I need to get True as a result even if a single word on the list is contained in the document.

Is this possible with bloom filter or do I need any other data structure. If no, what can be the best data structure which fulfills both the criteria.

3

There are 3 answers

1
NealB On BEST ANSWER

I don't think your use of Bloom filters is appropriate.

You state that two filters need to be constructed, one for the word list and another one for the document being searched. Then you bitwise AND the filters. If the result is the same as the original word list filter the document is accepted, otherwise you reject it.

If this is a correct understanding of your process it is quite clear that a document can only pass if it contains all of the words from the list (or by means of hash collisions a few extra bits were set in the document Bloom filter which might cause its acceptance - possibly resulting in a false positive).

If you want to select a document as soon as one word from the list is matched, you only need one Bloom filter for the word list (none for the document being tested). Hash each word from the document, one by one, using the Bloom filter hash functions. Accept the document as soon as the resulting hash matches on all corresponding bits in the word list Bloom filter (this is a hit). You then need to verify that the hit was not due to a false positive.

The "beauty" of a Bloom filter is that it does not suffer from false negatives. That is, if the none of the words from your document have a "hit" in the Bloom filter you can be 100 percent sure that none of the words in the document occur in the associated word list.

One problem you are facing is that Bloom filters are prone to false positives. A false positive is when there is a hit on the Bloom filter for a word not in the associated list. Because of this, you must explicitly verify the "truth" of every hit using the actual word list whenever the Bloom filter indicates a "hit". There is no way around it.

The "art" in constructing a good Bloom filter is to find a set of hashing functions that are cheap to execute and lead a low false positive hit rate (generally this equates to independence and good distribution among the hash functions). A quick google search on Bloom filters will give you plenty of guidance with respect to how many hash functions and how large the filter needs to be to achieve a given false positive rate (assuming good hashing functions).

If you do the calculations you will find that for a word list of any significant size you will need to use at least 7 hash functions to achieve an acceptable false-positive rate. Running 7 hash functions against every word in a large document is going to be "expensive" no matter how you do it. Furthermore, if your documents contain a large number of different words the probablilty of suffering a false positive hit on the Bloom filter may become significant - decreasing its usefulness.

Other answers here suggest that a better approach would be to construct a simple hash table for words in the list then hash each word from the document to see if it has a hit in the list. If so, select the document. Simple. This technique will most likely out perform the Bloom filter approach for this type of application unless there are some very special circumstances you have not outlined in your question.

EDIT - What is the best approach?

There are many ways of doing multi string searches on a given text. It is difficult to say which one is going to be the best solution for your application because most algorithms and their implementations are sensitive to complicating factors (eg. average word size, number of distinct words, document size, number of words in the search list, available memory, probability of having a "hit", etc.). There is no one correct answer here.

Using a Bloom filter may very well be a reasonable approach, however, you should at least look at other alternatives. A few examples might be:

The bottom line is you should look at a wider range of strategies before settling on any specific solution.

8
Bernhard Barker On

To check whether any of the words in the list exist in the document:

  • Create a bloom filter.
  • Insert all the words in the list into it.
  • Then check whether each word in the document is contained in the bloom filter. As soon as you find one, you can return true. If you don't find one, return false.

A bloom filter of course has a possibility of false positives - it can return true even if the word doesn't exist. To avoid this, you can use a hash table instead (in the same way as described above) - this would use a bit more memory though.

3
Jim Mischel On

If you're searching millions of documents for occurrences of a known set of words, then a Bloom filter is not the best choice. You end up doing an insert into the Bloom filter for every word in each document, and for each document you end up having to check the Bloom filter for each word to determine if it existed in the document.

If all you need to know is whether any one of your words exists in a document, you can build a hash table of the words you want to check, and test each word in the document. So, for example:

hashTable = {"boy", "mango", "car", "apple"}
for each document
{
    found = false
    for each word in document
    {
        if word in hash table
        {
            found = true
            break  // found a word. Skip the rest of the document.
        }
        if found then output success else output failure
    }
}

This will be better than the Bloom filter approach for several reasons:

  • A hash table lookup will in general be faster than setting multiple bits in a Bloom filter.
  • With the hash table you can skip much of a document if you find that it contains one of the words.
  • You only have to initialize the hash table once. You'd have to clear the Bloom filter for each document.
  • The size of your required Bloom filter will depend on the number of unique words in a document. If your Bloom filter is too small then the rate of false positives can be unreasonably high.