Skip to content

Fall 2023 ANLP Final Project: Exploring Noise Injection for Text Classification

Notifications You must be signed in to change notification settings

yoonichoi/NoiseInjection-TextClassification

 
 

Repository files navigation

Exploring Noise Injection for Text Classification

Building upon our comprehensive reimplementation of AEDA: An Easier Data Augmentation Technique for Text Classification, we extended our inquiry into advanced data augmentation techniques for text classification in natural language processing (NLP). We explored the integration of alphabet and numerical noise, in addition to AEDA’s foundational technique of punctuation mark insertion.

To view information on AEDA Reimplementation, click here
Link to our Poster and Report


Repository Structure

├── aeda
├── code
├── data
├── experiments
│   ├── addratio_experiment
│   ├── bert
│   ├── increments_experiment
│   └── numaug_experiment
└── reproduce_fig2
  • aeda and data are from the original AEDA repo
  • code includes augmentation code you can apply on your own data
  • experiments include code we used to run different experiments for our project. Refer to each folder's README for hyperparameter settings used
  • reproduce_fig2 is for our AEDA reimplementation task

Results from our experiments

You can find individual plots with better resolution in outputs/[runname]/plots folder for each experiment

  • Add Ratio Experiment alt text
  • Increments Experiment alt text
  • Number of Augmentations Experiment alt text

To run experiments

  1. Set up requirements
pip install -r requirements.txt
  1. Download glove.840B.300d to word2vec/ folder
wget https://nlp.stanford.edu/data/glove.840B.300d.zip && unzip glove.840B.300d.zip
mkdir word2vec 
mv glove.840B.300d.txt word2vec/ && rm glove.840B.300d.zip
  1. cd to the experiment folder you want to run.
cd experiments/[experiment_folder]
  1. Process data for training; this produces appropriate augmented data for the experiment, on top of the original training data. Refer to data_process.py to check which augmentations will be created.
python data_process.py
  1. Run the experiments.
python train_eval.py --seed 0 --runname myrun

train_eval.py takes three arguments, seed, runname and analyze. If you don't specify runname, it will automatically save experiment results under a folder name with current time.

  1. (Optional) Run below command to create a figure based on the experiments result, specifying runname in outputs/ that you want to create plots based on.
python plot_individual.py myrun

About

Fall 2023 ANLP Final Project: Exploring Noise Injection for Text Classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.4%
  • Shell 0.6%