Skip to content

lal-s/text-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

text-retrieval

This is a simple text retrieval system which parses wikipedia documents and builds the inverted index using Lucene library.

The aim is reconstruct parts of IBM's Watson which is a question answering system developed for a quiz show 'Jeopardy'

The system takes in a query as input and processes it using Lucene's Query Parser. The scoring algorithm uses tf-idf term weighting to get the result where:

tf(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
idf(t) = log_e(Total number of documents / Number of documents with term t in it).
score= tf * idf

The system then returns the top 10 matching documents . On the evaluated data set, the precision for the system for

precision@1 = 62%
precision@10 = 79%

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages