This repo is a simple study of Information Retrieval (IR), it's a simple search engine tool implemented using Python3. It uses Cranfield experiments dataset to evaluate retrieved results.
git clone https://github.com/eLMoMaNi/simple-search-engine
cd ./simple-search-engine
pip install -r requirements.txt
#copy files to site-packages or create your files in same directory
from indexer import Indexer
indexer = Indexer('/path/to/cranfield_data.json')
Cranfield collection must be converted to a json file.
indexer.create_schema_file("schema.json")
indexer.print()
this will print some info like this:
Number of tokens: 140979
The number of unique words: 4500
The number of words that occur only once: 1488
The 10 most frequent words: flow, pressur, number, boundari \...
retriever = Retriever("schema.json")
once the Retriever is created. we can use it to seach documents (make queries).
This shows how to get top 100
documents for a query.
docs, = retriever.query("how to hack NASA using HTML",100)
docs, = retriever.query("how to hack NASA using HTML",100\
,get_bench=True, relevance_docs=[<doc IDs>])
relevance_docs gets a list of IDs of the relevent documents to evalute the Accuracy, F1 , Precision and Recall
.
raise NotImplementedError # :(
this project -currentlly- uses cranfield dataset instead of crawling the web.
This module used to create the index schema.
This includes clean the documents and tokenize them as follow:
- stemming all words using
PorterStemmer
- removing stop words using
nltk.corpus.stopwords
- removing words with less than 2 letters
- tokenizing words using regex
\w+
The schema created by the Indexer has the following structure
{
"token1":
{
"doc1":<tf_value>,
"doc2" :<tf_value>,
"doc3":<tf_value>,
"idf":<idf_value>
},
"token2":{...}...
}
you can see that the structure is very squeezed to favour space complexity over readability.
Note:
"idf"
is not allowed as document id, as it's used to indecate the term's idf.
This module used to retriever and search from the schema.
When a query is made, the Retriever creates a normalized tf-idf
weight vector for both the query tokens and all documents. Retriever uses the SMART notation ltc.lnc
.
Query l (log) t (idf) c (cosine) |
Document l (log) n (none) c (cosine) |
|
---|---|---|
Term Frequency (tf) | ||
Document Frequency (df) | ||
Normalization |
To calculate how much the document is relevent to the query.
we use cosine similarity (dot product) between the query vector and the document vector.
doc_scores = {}
for doc in doc_vectors:
doc_scores[doc] = \
self.__cos_similarity(query_vector, doc_vectors[doc])
# sort docs by thier scores
sorted_docs = sorted(doc_scores.items(), key=operator.itemgetter(1))
# get only top `k` docs ids
top_docs = [i[0] for i in sorted_docs[-k:][::-1]]
Cosine similarities for all documents are stored in doc_scores
dictionary with document id is the key.
the dictionary then is sorted by its values(scores) and top k
documents are returned.
- tqdm for the porgress bar
- Dr. Malak Abdullah for the course project
- Cranfield University for the dataset
- me @_@ as the creator of the repo