Larger Ghanaian Dataset

Introduction

An even larger corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset in a folder at Box.com that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The process and approximately half the files are borrowed from the large dataset, so this documentation will only highlight the differences between the two.

Pipeline

Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Almost all are the same as the large dataset.

Scrape articles - Same
Write causes - Same
Read causes - Same
Add beliefs and locations - Same
Interpret dates - Same
Find nearest locations - Although there are still locations included for each sentence and each context window around a sentence, there's was not previously a direct indication of where in the window the locations are found, and the context may not be appropriately sized anyway, so now the nearest location is identified for each sentence. In this way nearly every sentence is associated with a location. Whether it is one suitable to use is left to the consumer of the data. There are four extra columns involved, so check below for details.

Sources

News articles were collected from seven sources, which is one less than before. One was removed due to terms of service requirements. Another, 3News, has added a CAPTCHA mechanism since the last dataset was put together, so there are no new articles from there. Here are the sources:

3News - only articles from the large dataset
Adom Online
The Chronicle
CITI FM
e.tv ghana
GhanaWeb - now with articles back to 2017
Happy FM

Search Terms

Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. It seems clear from the results that more recent article are favored. GhanaWeb does offer some settings related to date of publication and they were used to specify years from 2017 through 2023. The previous dataset included only 2023 for the older search terms, but the other years have now been backfilled. Again, gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used with the new ones in bold:

crop
galamsey
gold
harvest
livestock
mining
price

Counts

The counts below describe the dataset size in various dimensions. The first few tables focus on files, which correspond to articles (modulo duplication). The later tables are based on sentences.

Description	Count
Pages downloaded	67726
Files scraped*	67700
Files processed by Eidos+	67682

*Some pages were not articles, but error pages stating that the article couldn't be found. Some pages matched multiple search terms. The count here and on the next line include the duplicate pages. Duplicates are first accounted for at the sentence level. +Some files could not be read because of a bug in the processing of holidays.

The processed files are distributed across search terms as such:

Search Term	Count
crop	9495
galamsey	9256
gold	14492
harvest	5489
livestock	1400
mining	13252
price	14298
total	67682

The processed files are distributed across sources like this:

Source	Count
3News	3891
Adom Online	14757
The Chronicle	2119
CITI FM	4938
e.tv ghana	1725
GhanaWeb	38102
Happy FM	2150
total	67682

Sentences from the files are classified as to whether they are part of a causal assertion or express a belief. They break down like this:

Description	Count
Causal sentences	95740
Belief sentences	122618
Sentences both causal and belief	11525
Sentences neither causal nor belief	1159174
All sentences	1366007

Finally, sentences are spread out across article publication date like this:

Year	Count
~~2001~~	35
2013	189
2014	8530
2015	9799
2016	15743
2017	216963
2018	159015
2019	156118
2020	185409
2021	184535
2022	236369
2023	193337
total	1366007

Columns

The dataset contains quite a few columns. Except for the last several, all are the same as the columns of the large dataset.

url - Same
terms - Same
date - Same
sentenceIndex - Same
sentence = Same
causal - Same
causalIndex - Same
negationCount - Same
causeIncCount - Same
causeDecCount - Same
causePosCount - Same
causeNegCount - Same
effectIncCount - Same
effectDecCount - Same
effectPosCount - Same
effectNegCount - Same
causeText - Same
effectText - Same
belief - Same
sent_locs - Same
context_locs - Same
canonicalDate - Same
prevLocation - The closest location found starting with the sentence in question and searching towards the start of the article, if any.
prevDistance - The distance in sentences between the current sentence and the sentence with the prevLocation. This will be an integer 0 or greater or blank if there is no prevLocation.
nextLocation - The closest location found starting with the sentence in question and searching towards the end of the article, if any.
nextDistance - The distance in sentences between the current sentence and the sentence with the nextLocation. This will be an integer 0 or greater or blank if there is no nextLocation.

When dealing with previous and next locations, you might decide a maximum acceptable distance in either direction and keep only values that are within that distance.

Materials

No additional files were borrowed for this project. The list is the same as for the large dataset:

belief model - Same
locations file - Same

Provide feedback

Saved searches

Use saved searches to filter your results more quickly