This repository contains a dataset and experimental scripts associated the work "The Role of Context in Detecting Previously Fact-Checked Claims" accepted at NAACL (Findings), 2022. The paper focuses on detecting previously fact-checked claims based on the input claims. We focus on claims made in a political debate and we study the impact of modeling the context of the claim: both on the source side, i.e., in the debate, as well as on the target side, i.e., in the fact-checking explanation document.
Code and dataset for a similar work can be found in "That is a Known Lie: Detecting Previously Fact-Checked Claims". You can find the paper in this link, https://www.aclweb.org/anthology/2020.acl-main.332.pdf.
Table of Contents:
More details of the dataset can be found in data directory.
To train the reranker we need to get several things:
- Elasticsearch scores between the input-claims (iclaims) and the verified-claims (vclaims) dataset.
- SBERT embeddings of the vclaims and their article.
- SBERT embeddings of the iclaims.
After getting them we create the feature vectors for the ranksvm to have a better rerank system.
To run elasticsearch you need to first runt he elasticsearch server before running the experiment script.
elasticsearch -d # The flag d is to run the elasticsearch server in the background as a demeon process
To install elasticsearch, check this link.
After having the elasticsearch server running, to run the experiments you should run
data_dir="../data/" #This is the directory with that contains the dataset.
elasticsearch_score_dir="$data_dir/elasticsearch.scores.100" #THe directory where the data will be saved
python get_elasticsearch_scores.py \
-d $data_dir \
-o $elasticsearch_score_dir \
-n 100 \
--index politifact \
--coref-ext $coref_ext \
--load \ # Run thisflag only once to load the data in the elasticsearch server
--lower 0 --upper 70 --positives
Usage of the running script
usage: get_elasticsearch_scores.py [-h] --data-root DATA_ROOT --out-dir
OUT_DIR [--nscores NSCORES] [--index INDEX]
[--load] [--positives] [--lower LOWER]
[--upper UPPER]
[--concat-before CONCAT_BEFORE]
[--concat-after CONCAT_AFTER]
[--coref-ext COREF_EXT]
Get elasticsearch scores
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT, -d DATA_ROOT
Path to the dataset directory.
--out-dir OUT_DIR, -o OUT_DIR
Path to the output directory
--nscores NSCORES, -n NSCORES
Number of retrieved vclaims scores returned
per iclaim.
--index INDEX index used to put data in elasticsearch
--load Reload the data into elasticsearch if this flag is set.
--positives If this flag is set only the sentences with a vclaims
would be scored from the transcript
--lower LOWER To run the code over batches, the code would run the
trasncripts[lower:upper]
--upper UPPER To run the code over batches, the code would run the
trasncripts[lower:upper]
--concat-before CONCAT_BEFORE
Number of sentences concatenated before the input
sentence in a transcript
--concat-after CONCAT_AFTER
Number of sentences concatenated after the input
sentence in a transcript
--coref-ext COREF_EXT
If using co-referenced resolved transcripts it gives
the resolved coreference data.
({data}/transcripts{COREF_EXT})
The second step is to get the sentenceBERT (SBERT) embeddings of the following:
- vclaim (vclaim)
- vclaim-article (title)
- vclaim-artcile-title (text)
- iclaims (transcript)
To run the experiment,
data_dir="../data/politifact"
bert_embedding_dir="$data_dir/SBERT.large.embeddings"
sbert_config="config/bert-specs.json"
python get_bert_embeddings.py \
-d $data_dir \
-o $bert_embedding_dir \
-c $sbert_config \
--coref-ext $coref_ext \
-i transcript vclaim title text
Usage of the script,
usage: get_bert_embeddings.py [-h] --data-root DATA_ROOT --out-dir OUT_DIR
--config CONFIG
[--input {vclaim,title,text,transcript} [{vclaim,title,text,transcript} ...]]
[--coref-ext COREF_EXT]
[--concat-before CONCAT_BEFORE]
[--concat-after CONCAT_AFTER]
Get elasticsearch scores
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT, -d DATA_ROOT
Path to the dataset directory.
--out-dir OUT_DIR, -o OUT_DIR
Path to the output directory
--config CONFIG, -c CONFIG
Path to the config file
--input {vclaim,title,text,transcript} [{vclaim,title,text,transcript} ...], -i {vclaim,title,text,transcript} [{vclaim,title,text,transcript} ...]
What dataentry you want to get the SBERT embeddings for.
--coref-ext COREF_EXT
If using co-referenced resolved transcripts it gives
the resolved coreference data.
({data}/transcripts{EXT} provide EXT)
--concat-before CONCAT_BEFORE
Number of sentences concatenated before the input
sentence in a transcript
--concat-after CONCAT_AFTER
Number of sentences concatenated after the input
sentence in a transcript
To create the input for the rankSVM, you combine all the score and recoprical rank you get from elasticssearch when you query on {vclaim, title, text, all} and the cosine similarity score with its recoprical rank between the iclaim and {vclaim, title, top-4 sentences from text}.
To run that script you can do the following,
data_dir="../data/politifact"
elasticsearch_score_dir="$data_dir/elasticsearch.scores.100"
bert_embedding_dir="$data_dir/SBERT.large.embeddings"
rerank_embedding_dir="$data_dir/rerank.embeddings"
python get_rerank_embeddings.py \
-d $data_dir \
-o $rerank_embedding_dir \
-b $bert_embedding_dir \
-e $elasticsearch_score_dir \
-N 100 \
-n 4 \
-m text \ # Use which query mode from elastic search to get the top documents to be reranked
--lower $1 --upper $2 --positives
Usage of the script,
usage: get_rerank_embeddings.py [-h] --data-root DATA_ROOT --out-dir OUT_DIR
--SBERT-dir SBERT_DIR --elasticsearch-dir
ELASTICSEARCH_DIR [--list-length LIST_LENGTH]
[--num-sentences NUM_SENTENCES]
[--measure {all,vclaim,title,text}]
[--lower LOWER] [--upper UPPER]
[--num-workers NUM_WORKERS]
[--randomness RANDOMNESS] [--manual]
[--concat-before CONCAT_BEFORE]
[--concat-after CONCAT_AFTER]
[--coref-ext COREF_EXT] [--positives]
Get elasticsearch scores
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT, -d DATA_ROOT
Path to the dataset directory.
--out-dir OUT_DIR, -o OUT_DIR
Path to the output directory
--SBERT-dir SBERT_DIR, -b SBERT_DIR
Path to the SBERT embedding directory
--elasticsearch-dir ELASTICSEARCH_DIR, -e ELASTICSEARCH_DIR
Path to the elasticsearch embedding directory
--list-length LIST_LENGTH, -N LIST_LENGTH
Number of retrieved verified claims scores returned
per sentence.
--num-sentences NUM_SENTENCES, -n NUM_SENTENCES
Number of retrieved verified claims scores returned
per sentence.
--measure {all,vclaim,title,text}, -m {all,vclaim,title,text}
Choose ranked list on what metric from elasticsearch
--lower LOWER To run the code over batches, the code would run the
trasncripts[lower:upper]
--upper UPPER To run the code over batches, the code would run the
trasncripts[lower:upper]
--num-workers NUM_WORKERS
Number of parallelized processes
--randomness RANDOMNESS, -r RANDOMNESS
Propbablity of adding the true label in the list
--manual If the flag is set, then the labels fromt he manual
annotations will be considered
--concat-before CONCAT_BEFORE
Number of sentences concatenated before the input
sentence in a transcript
--concat-after CONCAT_AFTER
Number of sentences concatenated after the input
sentence in a transcript
--coref-ext COREF_EXT
If using co-referenced resolved transcripts it gives
the resolved coreference data.
({data}/transcripts{COREF_EXT} provide COREF_EXT)
--positives If this flag is set only the sentences with a verified
claim would be scored from the transcript
We use the rankSVM tool from LIBSVM Tools. We use the model present in this zip file. To train the rankSVM, first we will convert the numpy vectors produced by get_rerank_embeddings.py to the right format and then train it.
To change the format run,
data_dir="../data/politifact"
elasticsearch_score_dir="$data_dir/elasticsearch.scores.100"
bert_embedding_dir="$data_dir/SBERT.large.embeddings"
rerank_embedding_dir="$data_dir/rerank.embeddings"
ranksvm_embeddings_dir=$data_dir/ranksvm.embeddings
python libsvm-datahandler.py \
-d $data_dir \
-o $ranksvm_embeddings_dir \
-r $rerank_embedding_dir \
To run the rankSVM, download and MAKE the tool from this zip file. Then run,
data_dir="../data/politifact"
ranksvm_embeddings_dir=$data_dir/ranksvm.embeddings
ranksvm_dir="./ranksvm"
time ./$ranksvm_dir/svm-scale -l -1 -u 1 -s $ranksvm_embeddings_dir/ranksvm.range $ranksvm_embeddings_dir/train.qid > $ranksvm_embeddings_dir/train.qid.scale
time ./$ranksvm_dir/svm-train $ranksvm_embeddings_dir/train.qid.scale $ranksvm_embeddings_dir/ranksvm.model
time ./$ranksvm_dir/svm-scale -r $ranksvm_embeddings_dir/ranksvm.range $ranksvm_embeddings_dir/test.qid > $ranksvm_embeddings_dir/test.qid.scale
time ./$ranksvm_dir/svm-predict $ranksvm_embeddings_dir/test.qid.scale $ranksvm_embeddings_dir/ranksvm.model $ranksvm_embeddings_dir/test.qid.predict
Usage of the script,
usage: libsvm-datahandler.py [-h] --data-root DATA_ROOT --out-dir OUT_DIR
--reranker-dir RERANKER_DIR [--before BEFORE]
[--after AFTER] [--concat-before CONCAT_BEFORE]
[--concat-after CONCAT_AFTER]
[--coref-ext COREF_EXT]
Get elasticsearch scores
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT, -d DATA_ROOT
Path to the dataset directory.
--out-dir OUT_DIR, -o OUT_DIR
Path to the output directory
--reranker-dir RERANKER_DIR, -r RERANKER_DIR
Path to the reranker embeddings directory
--before BEFORE Number of sentences you take as context from BEFRORE
the sentence
--after AFTER Number of sentences you take as context from AFTER the
sentence
--concat-before CONCAT_BEFORE
Number of sentences concatenated before the input
sentence in a transcript
--concat-after CONCAT_AFTER
Number of sentences concatenated after the input
sentence in a transcript
--coref-ext COREF_EXT
If using co-referenced resolved transcripts it gives
the resolved coreference data.
({data}/transcripts{EXT} provide EXT)
Please cite the following paper.
Shaden Shaar and Firoj Alam and Da San Martino, Giovanni and Preslav Nakov, "The Role of Context in Detecting Previously Fact-Checked Claims", Findings of NAACL 2022, download.
@inproceedings{Claim:retrieval:context:2022,
author = {Shaden Shaar and
Firoj Alam and
Da San Martino, Giovanni and
Preslav Nakov},
title = {The Role of Context in Detecting Previously Fact-Checked Claims},
booktitle = {Findings of the Association for Computational Linguistics: NAACL-HLT 2022},
series = {NAACL-HLT~'22},
address = {Seattle, Washington, USA},
year = {2022},
}
- Shaden Shaar, Qatar Computing Research Institute, HBKU, Qatar
- Firoj Alam, Qatar Computing Research Institute, HBKU, Qatar
- Giovanni Da San Martino, University of Padova, Italy
- Preslav Nakov, Qatar Computing Research Institute, HBKU, Qatar Qatar
This dataset is published under CC BY-NC-SA 4.0 license, which means everyone can use this dataset for non-commercial research purpose: https://creativecommons.org/licenses/by-nc/4.0/.