In this post, I would share my experience of building a “Wikipedia Search Engine from scratch” built as a mini-project as a part of the course-Information Retrieval and Extraction at IIIT-H.
Firstly, I would like to mention its a very basic search engine and it is super easy to understand the architecture and to set up a working system like this. By basic I mean there are no semantics or meaning based search involved in building this and the working is based on exact word matching.
Fig1. above shows the architecture of the system. It consists of two parts :
- Offline task : In this part, all the processing is done Offline(beforehand) which would be used by the Online part. The offline part consists of two main tasks :
- Crawling : The webpages are crawled from the web. Since I had already downloaded the 46GB wiki dump, I have not described about the crawling phase in the architecture and further in this post. We will assume we already have all the webpages available with us.
- Indexing : In this part all the documents are read and an index is created and kept on disk. I will explain in detail about how indexing is done in the next post. For basic understanding, we can say an index is just like an index we see in a book. The exact location of the webpages are stored in an index in such a way, such that given a query term q, we can retrieve all the documents containing q as quickly as possible.
- Online task : In this part all the processing is done in real time.I will explain the processing that happens step by step as belows :
- The user gives some query in natural language using a user interface.
- Query Parser : The query parser parses the query and processes it. In our case the processing includes : converting the query into lowercase, removing the stopwords from the query, and stemming. Then the engine uses the query tokens generated after processing to search the pages from the index, which is previously created and stored in the disk in the Offline task.
- Ranker : After getting the list of documents containing the query terms, the engine needs to rank the documents such that the best page/the most relevant page appears on as the top result. There are many ranking techniques available, and much research and study is ongoing on devising new ranking techniques. I have used one such popular technique OkapiBM25 for ranking the web pages of in our engine.
- The Ranker ranks the documents and gives the page titles of the ranked documents to the user interface as query results.