The basics of Search and TF-IDF vector

Using a search engine shields an engineer from understanding how search works. In this article, we will explore how it works.

What is search?

Search is showing documents that match a query. Document is anything that has information. Such as web page or PDF document. Query is what the user types in the search box.

There is lot of work done in the background. Consider Google. Google crawls all the web pages out there. It parses each web page and extracts keywords. Depending on the keywords, Google updates the index.

An example index looks like:

  • React: Doc1, Doc7, Doc9
  • jQuery: Doc2, Doc8
  • React native: Doc9
  • C#: Doc2, Doc3, Doc4
  • Python: Doc4, Doc5, Doc6
  • JavaScript: Doc7, Doc8, Doc9, Doc 10

Let’s say we have a page with a search box. The user types in jQuery.  We perform a reverse lookup. And return Doc2 and Doc8.  We return the document in the decreasing order of relevance. Each document has a score. The score indicates the relevance of the document to the query.

Search relevance

TF-IDF stands for Term Frequency – Inverse Document Frequency.

A term or token or word can occur multiple times within a document. Term frequency (TF) measures the frequency of the term within the document. If the term appears multiple times within a document, the term frequency is higher.

Inverse document frequency (IDF) measures the importance of the term. Consider the word, This. The word, This, occurs in almost all documents. However, a word like jQuery occurs in a fewer documents. Inverse document frequency measures the number of documents which contain the term. For example, the term, This, occurs in 100 documents. And the term, jQuery, occurs in 10 documents. Inverse Document Frequency is the inverse of the document frequency. The rarer the term, the higher the IDF, the more relevant are the documents containing the term.

Both term frequency and inverse document frequency is calculated at the time of indexing. The product is called TF-IDF. Every document has a TF-IDF vector for all the terms in the document. The higher the TF-IDF, the more relevant the document is.

 

Consider the above index as an example. The IDF for each term is shown below.

  1. React – 0.33
  2. jQuery – 0.5
  3. React native – 1
  4. C# – 0.33
  5. Python – 0.33
  6. JavaScript – 0.25

Consider the query, C# and jQuery.  Based on the reverse index lookup, Doc2, Doc3, Doc4 and Doc8 is returned. Assume that the terms occur approximately the same number of times within the document (~ TF). We will return the documents in the decreasing order of IDF.

  1. Doc2 (0.83)
  2. Doc8 (0.5)
  3. Doc3 (0.33)
  4. Doc4 (0.33)

I hope you found the article useful.

Related Posts

Leave a Reply

Your email address will not be published.