Linguistic Analysis using NLP

Author: Aleksander Moeslund Wael

About the project

This repo contains a Python script called get_linguistic_features.py - an information extraction script which performs part-of-speech (PoS) tagging and named-entity recognition (NER). It extracts the relative frequency of Nouns, Verbs, Adjective, and Adverbs per 10,000 words and the total number of unique persons (PER), locations (LOC), and organisations (ORGS). The extracted information is then saved for analysis or other use.

Data

For this project, the Uppsala Student English Corpus (USE) was used. Documentation can be found here. The following information can be found in the documentation: "The corpus consists of 1,489 essays written by 440 Swedish university students of English at three different levels, the majority in their first term of full-time studies. The total number of words is 1,221,265, which means an average essay length of 820 words. A typical first-term essay is somewhat shorter, averaging 777 words." The essays are stored across 14 sub-folders in the USEcorpus folder. The sub-folders correspond to term, style and subject of the essays.

Model

The information extraction is done using the en_core_web_md SpaCy model for Python. It is a medium-size pretrained model suited for a variety of english NLP tasks.

Pipeline

The get_linguistic_features.py follows these steps:

Import dependencies
Load SpaCy model
Initialize information extraction per sub-folder
Append information to a pandas dataframe and save it as a .csv file in the out folder

Requirements

The code is tested on Python 3.11.2. Futhermore, if your OS is not UNIX-based, a bash-compatible terminal is required for running shell scripts (such as Git for Windows).

Usage

The repo was setup to work with Windows (the WIN_ files), MacOS and Linux (the MACL_ files).

1. Clone repository to desired directory

git clone https://github.com/alekswael/PoS_NER_tagger
cd PoS_NER_tagger

2. Run setup script

NOTE: Depending on your OS, run either WIN_setup.sh or MACL_setup.sh.

The setup script does the following:

Creates a virtual environment for the project
Activates the virtual environment
Installs the correct versions of the packages required
Deactivates the virtual environment

bash WIN_setup.sh

3. Run pipeline

NOTE: Depending on your OS, run either WIN_run.sh or MACL_run.sh.

Run the script in a bash terminal.

The script does the following:

Activates the virtual environment
Runs get_linguistic_features.py located in the src folder
Deactivates the virtual environment

bash WIN_run.sh

Note on model tweaks

Some model parameters can be set through the argparse module. However, this requires running the Python script seperately OR altering the run*.sh file to include the arguments. The Python script is located in the src folder. Make sure to activate the environment before running the Python script.

get_linguistic_features.py [-h] [--folder FOLDER]

options:
  -h, --help       show this help message and exit
  --folder FOLDER  Relative path to the corpus folder (default: USEcorpus)

Repository structure

This repository has the following structure:

│   .gitignore
│   MACL_run.sh
│   MACL_setup.sh
│   README.md
│   requirements.txt
│   WIN_run.sh
│   WIN_setup.sh
│
├───.github
│       .keep
│
├───in
│   └───USEcorpus
│       ├───a1
│       │       0100.a1.txt
│       │       ...
│       ├───a2
│       │       0100.a2.txt
│       │       ...
│       ...
│       └───c1
│               0140.c1.txt
│               ...
│
├───out
│       .gitkeep
│
└───src
        .gitkeep
        get_linguistic_features.py

Example of saved table

Folder: a1

    Text_name     RF_NOUN   RF_VERB   RF_ADJ   RF_ADV    U_PER  U_LOC  U_ORG
0   0100.a1.txt   1530.9    1221.91   800.56   533.71    0      0      0
1   0101.a1.txt	  1165.41   1240.6    588.97   839.6     0      0      0
2   0102.a1.txt	  1493.98   1204.82	  686.75   481.93    0      0      0
3   0103.a1.txt   1096.35   1362.13	  598.01   575.86    0      0      3
4   0104.a1.txt   1320.99   1197.53	  567.9    679.01    0      1      5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linguistic Analysis using NLP

Author: Aleksander Moeslund Wael

About the project

Data

Model

Pipeline

Requirements

Usage

1. Clone repository to desired directory

2. Run setup script

3. Run pipeline

Note on model tweaks

Repository structure

Example of saved table

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
in/USEcorpus		in/USEcorpus
out		out
src		src
.gitignore		.gitignore
MACL_run.sh		MACL_run.sh
MACL_setup.sh		MACL_setup.sh
README.md		README.md
WIN_run.sh		WIN_run.sh
WIN_setup.sh		WIN_setup.sh
requirements.txt		requirements.txt

alekswael/PoS_NER_tagger

Folders and files

Latest commit

History

Repository files navigation

Linguistic Analysis using NLP

Author: Aleksander Moeslund Wael

About the project

Data

Model

Pipeline

Requirements

Usage

1. Clone repository to desired directory

2. Run setup script

3. Run pipeline

Note on model tweaks

Repository structure

Example of saved table

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages