Skip to content

Larger Ghanaian Dataset

Keith Alcock edited this page Mar 20, 2024 · 13 revisions

Introduction

An even larger corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset in a folder at Box.com that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The process and approximately half the files are borrowed from the large dataset, so this documentation will only highlight the differences between the two.

Pipeline

Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Almost all are the same as the large dataset.

  • Scrape articles - Same
  • Write causes - Same
  • Read causes - Same
  • Add beliefs and locations - Same
  • Interpret dates - Same
  • Find nearest locations - Although there are still locations included for each sentence and each context window around a sentence, there's was not previously a direct indication of where in the window the locations are found, and the context may not be appropriately sized anyway, so now the nearest location is identified for each sentence. In this way nearly every sentence is associated with a location. Whether it is one suitable to use is left to the consumer of the data. There are four extra columns involved, so check below for details.

Sources

News articles were collected from seven sources, which is one less than before. One was removed due to terms of service requirements. Another, 3News, has added a CAPTCHA mechanism since the last dataset was put together, so there are no new articles from there. Here are the sources:

Search Terms

Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. It seems clear from the results that more recent article are favored. GhanaWeb does offer some settings related to date of publication and they were used to specify years from 2017 through 2023. The previous dataset included only 2023 for the older search terms, but the other years have now been backfilled. Again, gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used with the new ones in bold:

  • crop
  • galamsey
  • gold
  • harvest
  • livestock
  • mining
  • price

Counts

The counts below describe the dataset size in various dimensions. The first few tables focus on files, which correspond to articles (modulo duplication). The later tables are based on sentences.

Description Count
Pages downloaded 67726
Files scraped* 67700
Files processed by Eidos+ 67682

*Some pages were not articles, but error pages stating that the article couldn't be found. Some pages matched multiple search terms. The count here and on the next line include the duplicate pages. Duplicates are first accounted for at the sentence level. +Some files could not be read because of a bug in the processing of holidays.

The processed files are distributed across search terms as such:

Search Term Count
crop 9495
galamsey 9256
gold 14492
harvest 5489
livestock 1400
mining 13252
price 14298
total 67682

The processed files are distributed across sources like this:

Source Count
3News 3891
Adom Online 14757
The Chronicle 2119
CITI FM 4938
e.tv ghana 1725
GhanaWeb 38102
Happy FM 2150
total 67682

Sentences from the files are classified as to whether they are part of a causal assertion or express a belief. They break down like this:

Description Count
Causal sentences 95740
Belief sentences 122618
Sentences both causal and belief 11525
Sentences neither causal nor belief 1159174
All sentences 1366007

Finally, sentences are spread out across article publication date like this:

Year Count
2001 35
2013 189
2014 8530
2015 9799
2016 15743
2017 216963
2018 159015
2019 156118
2020 185409
2021 184535
2022 236369
2023 193337
total 1366007

Columns

The dataset contains quite a few columns. Except for the last several, all are the same as the columns of the large dataset.

  • url - Same
  • terms - Same
  • date - Same
  • sentenceIndex - Same
  • sentence = Same
  • causal - Same
  • causalIndex - Same
  • negationCount - Same
  • causeIncCount - Same
  • causeDecCount - Same
  • causePosCount - Same
  • causeNegCount - Same
  • effectIncCount - Same
  • effectDecCount - Same
  • effectPosCount - Same
  • effectNegCount - Same
  • causeText - Same
  • effectText - Same
  • belief - Same
  • sent_locs - Same
  • context_locs - Same
  • canonicalDate - Same
  • prevLocation - The closest location found starting with the sentence in question and searching towards the start of the article, if any.
  • prevDistance - The distance in sentences between the current sentence and the sentence with the prevLocation. This will be an integer 0 or greater or blank if there is no prevLocation.
  • nextLocation - The closest location found starting with the sentence in question and searching towards the end of the article, if any.
  • nextDistance - The distance in sentences between the current sentence and the sentence with the nextLocation. This will be an integer 0 or greater or blank if there is no nextLocation.

When dealing with previous and next locations, you might decide a maximum acceptable distance in either direction and keep only values that are within that distance.

Materials

No additional files were borrowed for this project. The list is the same as for the large dataset: