Skip to content

Code for the SCiL 2020 paper "A closer look at the performance of neural language models on reflexive anaphor licensing"

Notifications You must be signed in to change notification settings

jennhu/reflexive-anaphor-licensing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A closer look at the performance of neural language models on reflexive anaphor licensing

This repository contains the materials for the following paper:

Jennifer Hu, Sherry Yong Chen, and Roger Levy (2020). A closer look at the performance of neural language models on reflexive anaphor licensing. Proceedings of the Society for Computation in Linguistics (SCiL 2020) Volume 3.

If you use any of our code, data, or analyses, please cite the paper using the bib entry below:

@InProceedings{Hu:et-al:2020,
  Author = {Hu, Jennifer and Chen, Sherry Yong and Levy, Roger},
  Title = {A closer look at the performance of neural language models on reflexive anaphor licensing},
  Booktitle = {Proceedings of the Society for Computation in Linguistics (SCiL 2020)},
  Volume = {3},
  Pages = {382-392},
  Year = {2020},
  URL = {https://scholarworks.umass.edu/scil/vol3/iss1/37/}
}

Overview

Our materials are organized into three primary folders:

  • analysis (code for reproducing the results and figures in the paper)
  • data (accuracy and surprisal results from each model)
  • stimuli (test suites and script for extracting sentences)

Please note that we do not provide code for running each model. For more details, see the Dependencies section.

Stimuli

For each experiment, a .csv file containing the stimuli can be found at stimuli/<EXPERIMENT>/<PRONOUN>.csv. The file is structured as follows:

column values related experiments description
item 1, 2, 3, ... 1-4 index of item
clause_type embed, simple 1-4 whether there's an embedded clause (sentential complement & relative clause constructions) or single clause (prepositional phrases)
locality local, nonlocal 1-3 whether intended antecedent is in same clause as reflexive (local) or matrix clause (non-local)
c-command c-command, no-c-command 1a, 2, 4 whether intended antecedent c-commands the reflexive
grammatical 0, 1 1-4 whether item is grammatical
counterbalance a, b, c, d 1-4 counterbalancing index to ensure that every lexical item appears equally often
mismatch_position none, X 1-4 which NP is mismatched (X in ["head_noun", "distractor_noun", "local_sub","nonlocal_subj", "matrix_subj_noun", "rc_subj_noun"]; none if grammatical)
mismatch_feature none, number 1-4 which feature is mismatched (none if grammatical)
head_noun, distractor_noun lexical item 4 head NP (licensing), distractor NP inside prepositional phrase (non-licensing)
local_sub, nonlocal_subj lexical item 1b, 3 local clause subject NP (licensing), matrix clause subject NP (non-licensing)
matrix_subj_noun, rc_subj_noun lexical item 1a, 2 matrix clause subject NP (licensing), relative clause subject NP (non-licensing)
verb lexical item 1-4 verb for item
pronoun herself, himself, themselves 1-4 target reflexive pronoun
sentence, sentence_no_eos sentence 1-4 final sentence (with or without <eos> token)

Extracting sentences

To extract the sentences from this file, use the script stimuli/extract_sentences.py. You can toggle flags like --uncased and --eos depending on the requirements of your model. Please note that the final period at the end of each sentence is separated by whitespace. Otherwise, no tokenization assumptions are made.

Selecting lexical items

The stimuli/lexicon folder contains code and counts of nouns, pronouns, and verbs in the GRNN Wikipedia training corpus. See the paper for more details on how we constructed our materials.

Vocabulary issues

In our novel materials (used in ['exp2-rc-all', 'exp3-comp', 'exp4-pp']), the lexical items are designed to be in-vocabulary for models trained on the Penn Treebank. This is not the case for the materials used in Experiment 1, the Marvin & Linzen (2018) replication.

Data

The per-token surprisal values for each model can be found in the data folder, following this naming convention:

data/surprisal/<MODEL>/<EXPERIMENT>/<PRONOUN>_<MODEL>.txt

The BERT data is in a slightly different .csv format, but otherwise follows the same naming convention.

The accuracy results can be found at

data/accuracy/<EXPERIMENT>.csv

Dependencies

Our analysis code requires a basic scientific installation of Python (numpy, pandas, matplotlib, seaborn, etc.).

If you would like to run a model on our materials -- whether to replicate our results or assess a new model -- you will have to set up the computing environment yourself. Please see the repositories linked below for the source code of the models we evaluated (with the exception of n-gram, which uses SRILM). We can make the training script for our n-gram model available upon request.

Models evaluated in our analysis

  1. GRNN (Gulordava et al. 2018)
  2. JRNN (Jozefowicz et al. 2016)
  3. RNNG (Dyer et al. 2016; we use an unpublished version to get incremental RNNG parses, courtesy of Peng Qian)
  4. Transformer-XL (Dai et al. 2019; we use the huggingface implementation)
  5. Tiny LSTM
  6. n-gram

Reproducing our results

Figures

To generate the plots for a given experiment and list of models, run the following:

cd analysis
mkdir -p figures
python generate_plot.py -o figures -model <MODELS> -exp <EXPERIMENT> -vs

This will save a plot to analysis/figures/<EXPERIMENT>-<MODELS>.png. Note that <MODELS> can be a list of model names (e.g. -model rnng bert jrnn), 'big' for large-vocabulary models, or 'all' for all models. The large-vocabulary models are BERT, Transformer-XL, JRNN, GRNN, 5-gram.

The -vs flag specifies to plot the negative log probability differential. You can omit the flag to plot the raw negative log probabilities.

To plot the results for all our experiments, run the following (replacing figures with your desired output folder, which will be created if it does not exist):

cd analysis
./get_figures figures

Accuracy

Similarly, to compute the accuracy for a given experiment and list of models, run:

cd analysis
mkdir -p accuracy
python compute_accuracy.py -o accuracy -model <MODELS> -exp <EXPERIMENT>

This will save a file to analysis/accuracy/<EXPERIMENT>-<MODELS>.csv.

To compute the accuracy for all our experiments, run the following:

cd analysis
./get_accuracy accuracy

About

Code for the SCiL 2020 paper "A closer look at the performance of neural language models on reflexive anaphor licensing"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published