NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Five different Wikipedia pages were chosen for each category: Geographic, Non-Geographic. The pages texts were extracted using the Wikipedia API and can be stored into a .txt file.

The texts are preprocessed using in order:

A regular expression to remove all non alphabetic characters;
A function to transform each word in lowercase;
Tokenization using nltk word_tokenize;
Removing stopwords using nltk english stopwords corpus;
Stemming using Porter's algorithm;
Spell correction using autocorrect library.

After the preprocessing step, the bag-of-words is created for each text and can be stored into a .txt file. Finally, the labels are applied to each BoW and a training set is created.

The classifier is built and trained using the nltk NaiveBayesClassifier.

Instructions

You simply need to put the text to classify into input_text.txt file.

Then run the main.py script and the console tells you the predicted class for that text.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bow		bow
texts		texts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
geographic_example.txt		geographic_example.txt
input_text.txt		input_text.txt
main.py		main.py
non-geographic_example.txt		non-geographic_example.txt
report.txt		report.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Instructions

About

Releases

Packages

Languages

RedSut/NLP_Assignment1

Folders and files

Latest commit

History

Repository files navigation

NLP Assignment 1

Author: Davide Sut

ID: VR505441

The Assignment

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages