Skip to content

Latest commit



160 lines (129 loc) · 6.39 KB

File metadata and controls

160 lines (129 loc) · 6.39 KB


This repo is a simple study of Information Retrieval (IR), it's a simple search engine tool implemented using Python3. It uses Cranfield experiments dataset to evaluate retrieved results.

Table of contents


git clone
cd ./simple-search-engine
pip install -r requirements.txt
#copy files to site-packages or create your files in same directory


Using the Indexer

Creating the Indexer

from indexer import Indexer
indexer = Indexer('/path/to/cranfield_data.json')

Cranfield collection must be converted to a json file.

Creating schema file


Printing indexed documents statics


this will print some info like this:

Number of tokens: 140979
The number of unique words: 4500
The number of words that occur only once: 1488
The 10 most frequent words: flow, pressur, number, boundari \...

Creating and using the Retriever

retriever = Retriever("schema.json")

once the Retriever is created. we can use it to seach documents (make queries).

Make a query

This shows how to get top 100 documents for a query.

docs, = retriever.query("how to hack NASA using HTML",100)

Getting benchmarks

docs, = retriever.query("how to hack NASA using HTML",100\
                        ,get_bench=True, relevance_docs=[<doc IDs>])

relevance_docs gets a list of IDs of the relevent documents to evalute the Accuracy, F1 , Precision and Recall.

How it works


raise NotImplementedError # :(

this project -currentlly- uses cranfield dataset instead of crawling the web.


This module used to create the index schema.


This includes clean the documents and tokenize them as follow:

  • stemming all words using PorterStemmer
  • removing stop words using nltk.corpus.stopwords
  • removing words with less than 2 letters
  • tokenizing words using regex \w+


The schema created by the Indexer has the following structure

        "doc2" :<tf_value>,

you can see that the structure is very squeezed to favour space complexity over readability.

Note: "idf" is not allowed as document id, as it's used to indecate the term's idf.


This module used to retriever and search from the schema.

create vectors

When a query is made, the Retriever creates a normalized tf-idf weight vector for both the query tokens and all documents. Retriever uses the SMART notation ltc.lnc.

l(log) t(idf) c(cosine)
l(log) n(none) c(cosine)
Term Frequency (tf)
Document Frequency (df)

cosine similarity

To calculate how much the document is relevent to the query.
we use cosine similarity (dot product) between the query vector and the document vector.


        doc_scores = {}
        for doc in doc_vectors:
            doc_scores[doc] = \
                self.__cos_similarity(query_vector, doc_vectors[doc])

        # sort docs by thier scores
        sorted_docs = sorted(doc_scores.items(), key=operator.itemgetter(1))
        # get only top `k` docs ids
        top_docs = [i[0] for i in sorted_docs[-k:][::-1]]

Cosine similarities for all documents are stored in doc_scores dictionary with document id is the key. the dictionary then is sorted by its values(scores) and top k documents are returned.


  • tqdm for the porgress bar
  • Dr. Malak Abdullah for the course project
  • Cranfield University for the dataset
  • me @_@ as the creator of the repo