Larger Ghanaian Dataset

Introduction

An even larger corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The process and approximately half the files are borrowed from the large dataset, so this documentation will only highlight the differences between the two.

Pipeline

Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Almost all are the same as the large dataset.

Scrape articles - Same
Write causes - Same
Read causes - Same
Add beliefs and locations - Same
Interpret dates - Same
Find nearest locations - Although there are still locations included for each sentence and each context window around a sentence, there's was not previously a direct indication of where in the window the locations are found, and the context may not be appropriately sized anyway, so now the nearest location is identified for each sentence. In this way nearly every sentence is associated with a location. Whether it is one suitable to use is left to the consumer of the data. There are four extra columns involved, so check below for details.

Sources

News articles were collected from seven sources, which is one less than before. One was removed due to terms of service requirements. Another, 3News, has added a CAPTCHA mechanism since the last dataset was put together, so there are no new articles from there. Here are the sources:

3News - only articles from the large dataset
Adom Online
The Chronicle
CITI FM
e.tv ghana
GhanaWeb - now with articles back to 2017
Happy FM

Search Terms

Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. It seems clear from the results that more recent article are favored. GhanaWeb does offer some settings related to date of publication and they were used to specify years from 2017 through 2023. The previous dataset included only 2023 for the older search terms, but the other years have now been backfilled. Again, gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used with the new ones in bold:

crop
galamsey
gold
harvest
livestock
mining
price

Counts

The counts below describe the dataset size in various dimensions. The first few tables focus on files, which correspond to articles (modulo duplication). The later tables are based on sentences.

Description	Count
Pages downloaded	67726
Files scraped*	67700
Files processed by Eidos+	67682

*Some pages were not articles, but error pages stating that the article couldn't be found. Some pages matched multiple search terms. The count here and on the next line include the duplicate pages. Duplicates are first accounted for at the sentence level. +Some files could not be read because of a bug in the processing of holidays.

The processed files are distributed across search terms as such:

Search Term	Count
crop	9495
galamsey	9256
gold	14492
harvest	5489
livestock	1400
mining	13252
price	14298
total	67682

The processed files are distributed across sources like this:

Source	Count
3News	3891
Adom Online	14757
The Chronicle	2119
CITI FM	4938
e.tv ghana	1725
GhanaWeb	38102
Happy FM	2150
total	67682

Sentences from the files are classified as to whether they are part of a causal assertion or express a belief. They break down like this:

| Sentences | 1366007 | | Causal sentences | 95740 | | Belief sentences | 122618 | | Sentences both causal and belief | 11525 | | Sentences neither causal nor belief | 1159174 |

Finally, sentences are spread out across article publication date like this:

Year	Count
2001	35
2013	189
2014	8530
2015	9799
2016	15743
2017	216963
2018	159015
2019	156083
2020	185409
2021	184535
2022	236369
2023	193337
total	1366007

Columns

The dataset contains quite a few columns. Several are intended to address the TPI use case in which causes and effects can be increasing or decreasing.

url - The URL from which the article was downloaded.
terms - The term (or terms separated by a space) which led to the page. Articles are deduplicated per source on matching URLs.
date - The dateline from the article, if found, verbatim. A canonicalized version is now available in a later column.
sentenceIndex - The index of each sentence per article as tokenized by Eidos.
sentence = The text of the sentence. Line feeds and tabs have been replaced with spaces to make the tsv file easier to read.
causal - A Boolean to indicate whether the sentence includes a causal relation.
causalIndex - A sentence can contain multiple causal relations, so each is numbered and listed separately, with each of the preceding columns duplicated.
negationCount - A Boolean to indicate whether the relation is negated. That would mean, for example, that it didn't apply or happen.
causeIncCount - The number of phrases indicating that the cause increased or that there is more of it. They include words such as accelerate, boost, promote, and strengthen. They are called increase_triggers.
causeDecCount - The number of phrases indicating that the cause decreased or that there is less of it. They include words such as abate, curtail, decrease, and prohibit. They are called decrease_triggers.
causePosCount - The number of phrases indicating that the cause is positive positive_affect_triggers. Words to that end include aide, ease, relieve, better, and good.
causeNegCount - The number of phrases indicating that the cause is negative negative_affect_triggers. Words to that end include challenge, threaten, and worse.
effectIncCount - As causeIncCount, but for the effect.
effectDecCount - As causeDecCount, but for the effect.
effectPosCount - As causePosCount, but for the effect.
effectNegCount - As causeNegCount, but for the effect.
causeText - The text of the cause.
effectText - The text of the effect.
belief - A Boolean indicating whether the sentence (with possible help from the previous sentence if something like "they" or "this" needs to be resolved) contains a belief.
sent_locs - The geophysical location mentioned in the sentence, if any. This location and the next are comma-separated lists of some region and then in parentheses the latitude and longitude associated with the region: for example, Abronye (7.69381, -1.9091), Efuanta (5.28527, -2.00557).
context_locs - Locations mentioned within the previous or next three sentences around the belief.
canonicalDate - The article's date from the date column is converted here to one of three formats (so far): YYYY-MM-DDTHH:MM:SS or YYYY-MM-DDTHH:MM if a time was specified at all for the article, or YYYY-MM-DD if there was only a date.

Boolean values are written "True" and "False".

If causal is false, then columns for causalIndex, negationCount, causeIncCount, causeDecCount, causePosCount, causeNegCount, effectIncCount, effectDecCount, effectPosCount, and effectNegCount are empty. If there are no locations for sent_locs or context_locs, those columns are also empty.

Materials

Some files are borrowed from other projects:

belief model - The model should be downloaded automatically when needed, but it can also be done in advance.
locations file - This is small enough to include with the source code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly