Given a query, how does Google determine which documents to display?

587 views Asked by At

I'm curious about the intricacies of the search. I understand that tf-idf is used to evaluate the importance of a word in a document within a corpus. I also understand that the Page Rank algorithm ranks the relative importance of a web page by using its probability of being viewed as a heuristic. However, I'm not sure how the two interplay when given a specific query.

Intuitively, I would think that a language model would be used to rank documents, and this relates to tf-idf. But how does the Page Rank algorithm relate to the document retrieval?

1

There are 1 answers

0
bdean20 On

Ranking and retrieval are separate functions of a search engine.

The purpose of the retrieval component is to decide which documents are worth ranking. The purpose of the ranking component is to decide which documents are most relevant to the query. Page Rank is applied in the ranking phase as one of the factors to determine whether a query is relevant. This works because of the context of a web search engine being that you typically wish to search for web pages that other people have also found useful.

You can also use the Page Rank in deciding whether to rank the document at all, but I believe Google's approach focuses around giving stronger or weaker Page Ranks (based on incoming and outgoing links and the strengths of those links) rather than filtering.

In terms of answering the title question...
It's very complicated, and I don't work for them, so this is mostly just speculation, but I believe their system is built around a few fundamental concepts.

  1. Is the query correct?
    spell-checking, query-suggestion
  2. Is the content on this page relevant to the query?
    tf-idf and others**, phrase/proximity search
  3. Does this page have a high reputation?
    page rank, feedback from google's analytics
  4. Do the links to this page match the content in the query?
    link analysis
  5. Does this person (or people like them) want to see the content on this page?
    personalisation, localisation, etc
  6. Are there already too many results from the one website?
    diversification, uniquing
  7. What does the user mean by this query?
    relevance feedback, stemming, query expansion)

I'm sure there are more, but that's just off the top of my head.

** There are a lot of different methods that have been used for information retrieval. If you already know TF-IDF, BM25 would be a good one to look at next.

Note: If you have a different search context, these methods may not work very well. There are some types of search that are better suited to different models. For example, if your data is structured according to a schema then your best bet is to use a database.