NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Five different Wikipedia pages were chosen for each category: Geographic, Non-Geographic. The pages texts were extracted using the Wikipedia API and can be stored into a .txt file.

The texts are preprocessed using in order:

A regular expression to remove all non alphabetic characters;
A function to transform each word in lowercase;
Tokenization using nltk word_tokenize;
Removing stopwords using nltk english stopwords corpus;
Stemming using Porter's algorithm;
Spell correction using autocorrect library.

After the preprocessing step, the bag-of-words is created for each text and can be stored into a .txt file. Finally, the labels are applied to each BoW and a training set is created.

The classifier is built and trained using the nltk NaiveBayesClassifier.

Instructions

You simply need to put the text to classify into input_text.txt file.

Then run the main.py script and the console tells you the predicted class for that text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Instructions

Files

README.md

Latest commit

History

README.md

File metadata and controls

NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Instructions