Natural Language Processing: Predicting Genres of Poems
Data Source: https://www.kaggle.com/ultrajack/modern-renaissance-poetry
The dataset consists of the following data:
- Total no. of data points: 573
- Number of Attributes/Columns in data: 5
- Number of Authors: 67
- Number of different poems: 506
- Number of different poem names: 509
- Types(genres): 3
Attribute Information:
- author
- content: Poems
- poem Name
- age: Modern and Renaissance
- type: Genre(Mythology & Folklore, Nature and Love)
Loading the data: The dataset is available in .csv file.
Approach: In steps:-
- Removing the duplicate and Null Data.
- Each poem is converted to lower case.
- Removal of punctuations like --->,/./‘/: etc
- Removal of unimportant words like “A”,"An","The","Aboard","About","Above","Absent","Across","After", etc
- Vectorizing using TFIDF Vectorizer.
- Splitting the data into the 80/20 ratio.
- The first classifier used-----> XGBoost
- Second Classification using SVM
Packages used:
- Numpy
- Pandas
- Matplotlib
- Seaborn
- Regular expression--->re
- xgboost From Scikit-Learn:
- TfidfVectorizer
- train_test_split
- LabelEncoder
- confusion_matrix
- accuracy_score
- LinearSVC
- average_precision_score