Search is a complex subject matter. Using search engines simplifies the solution. Using a search engine shields an engineer from understanding how search works. The objective of this article is to provide a simplified understanding of how search works.
What is search?
Query is a set of words or tokens which is used to retrieve relevant information. Information is retrieved from a set of documents. For example, if we use Google to search the world-wide web, every web page is a document. If we type “What is search?” in the search box, “What is search?” is a query.
When the query is executed by the search engine, relevant documents are returned, in the order of decreasing score. The score computed is an indication of the relevance of the document to the query.
Central to the concept of Search relevance is the term: TF-IDF. Term Frequency – Inverse Document Frequency.
A term or token or word can occur multiple times within a document. Term frequency (TF) is proportional to the frequency of the term within the document. Higher the term frequency, the more relevant the document becomes for the query.
Inverse document frequency (IDF) is the importance of the term within the document set. There are words which are called stop words because they appear in every document. In our query, What is search, “is” appears more frequently in web pages. So, “is” is not that useful in the query. However, “search” appears less frequently in web pages. So, “search” is a more useful term in the query. The usefulness of a term in the document set is denoted by inverse document frequency. If a specific word or token is rare in the document set, the inverse document frequency is higher, and the presence of the word in the document causes the document to be more relevant.
Both term frequency and inverse document frequency is calculated when documents are indexed. The product of term frequency and inverse document frequency is called TF-IDF. Every document has a TF-IDF vector for all the terms in the document.
A document is more relevant with respect to a query if the product of TF-IDF values of the query and document is higher. This happens when terms in the query with high inverse document frequency (rarer terms) appears in the document with higher frequency.
The TF-IDF vector can also be used to verify if two documents are related. Two documents are related if the TF-IDF vector of the document is similar.
To understand more about Search, please view the video lectures of a course in Coursera: Week 7: Information retrieval and Ranked Information retrieval.