How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:
term | document_id
---------------------
program | 1, 2, 3, 5...
python | 1, 4
code | 4
c++ | 4, 5
the term program
is present in several documents meaning a query of program AND code
would require performing an intersection upon a very large set of documents.
Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?
Yes. Say you have the following query:
You first need to compute the document frequency of each positive term. You pick the word with the lowest count.
You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.
So the database has several subspace:
Then the filter + score step can happen in parallel