-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.txt
24 lines (18 loc) · 1.09 KB
/
report.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Davide Sut
ID: VR505441
Repository link: https://github.com/RedSut/NLP_Assignment1
Five different Wikipedia pages were chosen for each category: Geographic, Non-Geographic.
The pages texts were extracted using the Wikipedia API and can be stored into a .txt file.
The texts are preprocessed using in order:
- A regular expression to remove all non alphabetic characters;
- A function to transform each word in lowercase;
- Tokenization using nltk word_tokenize;
- Removing stopwords using nltk english stopwords corpus;
- Stemming using Porter's algorithm.
After the preprocessing step, the bag-of-words is created for each text and can be stored into a .txt file.
Finally, the labels are applied to each BoW and a training set is created.
The classifier is built and trained using the sklearn library:
- The first step is to transform the data into a TF-ID representation;
- The second step is to select the best k features based on their chi-squared scores;
- The third step is to use the Multinomial Naive Bayes algorithm.
In the end, accuracy, precision and recall metrics are calculated using the test set.