Efficient low-cardinality ANDs in a search engine

Question

Efficient low-cardinality ANDs in a search engine

183 views Asked by wispi At 06 October 2020 at 19:16

How do search engines such as Lucene, etc. perform AND queries where a term is common to many documents in the dataset? For example, in an inverted index of:

term    | document_id
---------------------
program | 1, 2, 3, 5...
python  | 1, 4
code    | 4
c++     | 4, 5

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Is there a way to perform AND queries without having to take the intersection of terms contained by potentially billions of documents?

Original Q&A

There are 1 answers

**amirouche** · Accepted Answer · 2020-10-16T21:29:31+00:00

the term program is present in several documents meaning a query of program AND code would require performing an intersection upon a very large set of documents.

Yes. Say you have the following query:

term1 AND term2 AND term3

You first need to compute the document frequency of each positive term. You pick the word with the lowest count.

You retrieve the documents that contains the least common term from the query. Those are the candidates. Then you filter and score those candidates with the query with a finite state machine.

So the database has several subspace:

A mapping from lemma or stem or term to document frequency (as in tfidf)
The actual inverted-index that allows to retrieve the documents that contains a given lemma
A mapping between the document id and the full text representation of the document or just bag of words depending on how advanced is your query logic.

Then the filter + score step can happen in parallel

TechQA.

Efficient low-cardinality ANDs in a search engine

There are 1 answers

Related Questions in SEARCH

Related Questions in INDEXING

Related Questions in LUCENE

Related Questions in SEARCH-ENGINE

Related Questions in INVERTED-INDEX

Popular Questions

Popular Tags

Trending Questions