Skip to content

RedSut/NLP_Assignment1

Repository files navigation

NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Five different Wikipedia pages were chosen for each category: Geographic, Non-Geographic. The pages texts were extracted using the Wikipedia API and can be stored into a .txt file.

The texts are preprocessed using in order:

  • A regular expression to remove all non alphabetic characters;
  • A function to transform each word in lowercase;
  • Tokenization using nltk word_tokenize;
  • Removing stopwords using nltk english stopwords corpus;
  • Stemming using Porter's algorithm;
  • Spell correction using autocorrect library.

After the preprocessing step, the bag-of-words is created for each text and can be stored into a .txt file. Finally, the labels are applied to each BoW and a training set is created.

The classifier is built and trained using the nltk NaiveBayesClassifier.

Instructions

You simply need to put the text to classify into input_text.txt file.

Then run the main.py script and the console tells you the predicted class for that text.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages