Name		Name	Last commit message	Last commit date
parent directory ..
Paper.pdf		Paper.pdf
README.md		README.md

README.md

Reading Wikipedia to Answer Open-Domain Questions

Summary

This paper considers the problem of answering factoid questions in an open-domain setting using Wikipedia as the unique knowledge source. Having a single knowledge source forces the model to be very precise while searching for an answer.

In order to answer any question, one must retrieve the relevant articles and then scan them to indentify the answer.

Architecture

Document Retriever

Uses an efficient (non-machine learning) document retrieval system to first narrow the search space and docus on relevant articles. A simple inverted index lookup followed by term vector model scoring is used.

Articles and questions are compared as TF-IDF (Term Frequency — Inverse Document Frequency) weighted bag-of-word vectors. It is further improved by taking local word order into account with n-gram features (BEST: bigram).

Document Reader

Given a question q consisting og l tokens and a document of n paragraphs where a single paragraph p consists of m tokens, an RNN model is developed which is applied to each paragraph and then finally aggregated to predict the answers.

Paragraph Encoding

The tokens in a paragraph is represented as a sequence of feature vectors which is then passed as the input to the , A multi-layer bidirectional Long Short-term memory network and take as the concatenation of each layer's hidden units in the end.

The feature vector is comprised of

Word Embeddings: . Using the 300-dimensional GloVe embeddings. The 1000 most frequent question words are fine tuned as some key words could be crucial to QA systems.
Exact Match: . Uses three simple binary features indicating whether can be exactly matched to one of the question word in q.
Token Features: . Manual features which reflect some properties of the token are added which include Part-of-speech (POS), Named-entity-recognition (NER) and (Normalized) Term-frequency (TF).
Aligned Question Embedding: where the attention score captures similarity between and each question word .

Question Encoding

Another RNN is applied on the word embeddings of and the resulting hidden units are combined into one single vector , where and encodes the importance of each question word.

Prediction

At the paragraph level, the goal is to predict the span of tokens that is most likely the correct answer.
Two classifiers are trained independently over the paragraph vectors and the question vector to predict the two ends of the span.

Data

Wikipedia (Knowledge Source) - Uses the 2016-12-21 dump of English Wikipedia as the knowledge source.
SQuAD (The Stanford Question Answering Dataset) - Uses SQuAD for training and evaluating the Document Reader.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DrQA - Reading Wikipedia to Answer Open-Domain Questions

DrQA - Reading Wikipedia to Answer Open-Domain Questions

README.md

Reading Wikipedia to Answer Open-Domain Questions

Summary

Architecture

Document Retriever

Document Reader

Paragraph Encoding

Question Encoding

Prediction

Data

Files

DrQA - Reading Wikipedia to Answer Open-Domain Questions

Directory actions

More options

Directory actions

More options

Latest commit

History

DrQA - Reading Wikipedia to Answer Open-Domain Questions

Folders and files

parent directory

README.md

Reading Wikipedia to Answer Open-Domain Questions

Summary

Architecture

Document Retriever

Document Reader

Paragraph Encoding

Question Encoding

Prediction

Data