-
Notifications
You must be signed in to change notification settings - Fork 6
Larger Ghanaian Dataset
An even larger corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset in a folder at Box.com that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The process and approximately half the files are borrowed from the large dataset, so this documentation will only highlight the differences between the two.
Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Almost all are the same as the large dataset.
- Scrape articles - Same
- Write causes - Same
- Read causes - Same
- Add beliefs and locations - Same
- Interpret dates - Same
- Find nearest locations - Although there are still locations included for each sentence and each context window around a sentence, there's was not previously a direct indication of where in the window the locations are found, and the context may not be appropriately sized anyway, so now the nearest location is identified for each sentence. In this way nearly every sentence is associated with a location. Whether it is one suitable to use is left to the consumer of the data. There are four extra columns involved, so check below for details.
News articles were collected from seven sources, which is one less than before. One was removed due to terms of service requirements. Another, 3News, has added a CAPTCHA mechanism since the last dataset was put together, so there are no new articles from there. Here are the sources:
- 3News - only articles from the large dataset
- Adom Online
- The Chronicle
- CITI FM
- e.tv ghana
- GhanaWeb - now with articles back to 2017
- Happy FM
Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. It seems clear from the results that more recent article are favored. GhanaWeb does offer some settings related to date of publication and they were used to specify years from 2017 through 2023. The previous dataset included only 2023 for the older search terms, but the other years have now been backfilled. Again, gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used with the new ones in bold:
- crop
- galamsey
- gold
- harvest
- livestock
- mining
- price
The counts below describe the dataset size in various dimensions. The first few tables focus on files, which correspond to articles (modulo duplication). The later tables are based on sentences.
Description | Count |
---|---|
Pages downloaded | 67726 |
Files scraped* | 67700 |
Files processed by Eidos+ | 67682 |
*Some pages were not articles, but error pages stating that the article couldn't be found. Some pages matched multiple search terms. The count here and on the next line include the duplicate pages. Duplicates are first accounted for at the sentence level. +Some files could not be read because of a bug in the processing of holidays.
The processed files are distributed across search terms as such:
Search Term | Count |
---|---|
crop | 9495 |
galamsey | 9256 |
gold | 14492 |
harvest | 5489 |
livestock | 1400 |
mining | 13252 |
price | 14298 |
total | 67682 |
The processed files are distributed across sources like this:
Source | Count |
---|---|
3News | 3891 |
Adom Online | 14757 |
The Chronicle | 2119 |
CITI FM | 4938 |
e.tv ghana | 1725 |
GhanaWeb | 38102 |
Happy FM | 2150 |
total | 67682 |
Sentences from the files are classified as to whether they are part of a causal assertion or express a belief. They break down like this:
Description | Count |
---|---|
Causal sentences | 95740 |
Belief sentences | 122618 |
Sentences both causal and belief | 11525 |
Sentences neither causal nor belief | 1159174 |
All sentences | 1366007 |
Finally, sentences are spread out across article publication date like this:
Year | Count |
---|---|
2013 | 189 |
2014 | 8530 |
2015 | 9799 |
2016 | 15743 |
2017 | 216963 |
2018 | 159015 |
2019 | 156118 |
2020 | 185409 |
2021 | 184535 |
2022 | 236369 |
2023 | 193337 |
total | 1366007 |
The dataset contains quite a few columns. Except for the last several, all are the same as the columns of the large dataset.
- url - Same
- terms - Same
- date - Same
- sentenceIndex - Same
- sentence = Same
- causal - Same
- causalIndex - Same
- negationCount - Same
- causeIncCount - Same
- causeDecCount - Same
- causePosCount - Same
- causeNegCount - Same
- effectIncCount - Same
- effectDecCount - Same
- effectPosCount - Same
- effectNegCount - Same
- causeText - Same
- effectText - Same
- belief - Same
- sent_locs - Same
- context_locs - Same
- canonicalDate - Same
- prevLocation - The closest location found starting with the sentence in question and searching towards the start of the article, if any.
- prevDistance - The distance in sentences between the current sentence and the sentence with the prevLocation. This will be an integer 0 or greater or blank if there is no prevLocation.
- nextLocation - The closest location found starting with the sentence in question and searching towards the end of the article, if any.
- nextDistance - The distance in sentences between the current sentence and the sentence with the nextLocation. This will be an integer 0 or greater or blank if there is no nextLocation.
When dealing with previous and next locations, you might decide a maximum acceptable distance in either direction and keep only values that are within that distance.
No additional files were borrowed for this project. The list is the same as for the large dataset:
- belief model - Same
- locations file - Same
- Datasets
- Grid
- Habitus Application
- Other