This repository repository contains several components developed and maintained throughout research
into the source discovery
project, which was conducted at USC from January 2022-May 2022 and Bloomberg
from May 2022-December 2022. Overall, the project seek develop systems to detect
, attribute
and
ultimately predict
quotes from information-sources used in journalistic news writing.
An information source is any source that is referenced by a journalist throughout the course of their reporting, and can include:
- Quotes from Named individuals
- Reports/Documents (e.g. legislative text, court cases, academic documents)
- Votes/Polls
See below for an example of the variety of different ways sources may be used in newswriting:
The aim of this work is multifold:
- Detection: We wish to detect in each sentence whether a source is being referenced
- Attribution: We wish to attribute each quote to an informational source.
- Prediction: We wish to predict, in ablated documents, whether the document needs another source or not.
This work does this by collecting annotations on existing news articles denoting whether the sentence is attributed to a source, which source that is, as well as characteristics about the source.
There are several distinct components in this repository.
models_neural
: Transformer-based models todetect
andattribute
quotes that were developed at Bloomberg for the purpose of this project.models_other
: Baseline models developed for this project as well as external academic models tested for this project like a custom-designed topic-model (see: https://arxiv.org/pdf/2104.09656.pdf) for more details.scripts
: Rule-based baselines developed and tested to detect and attribute sources.app
: the flask/javascript application developed to annotate documents.
In the following sections, we'll describe how to train the neural models.
The training process is the same for both detection
and attribution
. We have several scripts that run
different specific models. If you are running the script on DSP, the flags should all have data and model files
that are the appropriate ones for these experiments. In general, the data and model files that I used live in the
following s3 bucket:
s3://aspangher
The project-specific files, like the training data and test data, live in the following directory:
s3://aspangher/source-exploration/
and the general files, like Huggingface base models and Spacy models, live in subdirectories. But the paths in the scripts are valid.
Also another general point is that, unless specified, the trained model will be uploaded to the
following directory: s3://aspangher/source-exploration/./
with the model-name as the --notes
flag. See end of this
section for more information.
quote_detection/scripts/train.sh
: Runs training with the default setup: our annotated data and a roberta sentence-classification model. To run training, you must have the following flags set:
--pretrained_files_s3 <HUGGINGFACE_PRETRAINED_MODEL> \
--train_data_file_s3 <INPUT_TRAINING_FILE> \
quote_detection/scripts/train_polnear_data.sh
: Contains the same setup as the previous script, but for training with the polnear dataset.quote_detection/scripts/train_local.sh
: A convenient script for testing the training process locally.
quote_attribution/scripts/run_classification_script.sh
: Runs the classifier-based quote attributor.quote_attribution/scripts/run_span_detection_script.sh
: Runs the span-based quote attributor.
For both of these scripts, they expect the following flags to be set:
--pretrained_model_path <HUGGINGPACE_MODEL> \
--train_data_file <TRAINING_DATA_FILE> \
--spacy_model_file <SPACY_MODEL_FILE> \
The default location for all project-independent resources (i.e. Spacy model files, Huggingface model files) is s3://aspangher
.
Here you'll find the directories s3://aspangher/transformer-pretrained-models/
(huggingface models) and
s3://aspangher/spacy/
(spacy models).
The default location for all project-specific resources is s3://aspangher/source-exploration
. Here, you're find
training data, and previously trained models.
All the data management is found in models_neural/src/utils_data_access.py
. These methods are called by the various
train.py
methods. To change the default paths either pass in specific full paths or alter this file.
quote_detection
:
quote_detection/scripts/test.sh
: Runs inference. To work, you must have the following flags set:
--pretrained_files_s3 <PRETRAINED_MODEL>
--train_data_file_s3 <INPUT_DATA_FILE>
--discriminator_path <TRAINED_MODEL_PATH>
--processed_data_fname <OUTPUT_DATA_FILE>
Please obtain data.zip
file from Alex. Unzip this and place it in the app/data
directory, and then run python main.py
from the app/
directory.
Direct your browser to http://localhost:5002/render_table?task=affil-role
to begin annotating.
Reach out to Alex for further instructions, assignments, etc.