-
Notifications
You must be signed in to change notification settings - Fork 6
Larger Ghanaian Dataset
An even larger corpus of news articles has been collected and processed in a way that may be useful for TPI or PWLWP tasks in that both causal assertions and beliefs have been extracted. The result is a dataset that can be further analyzed, and this documentation, particularly descriptions of the columns, is intended to aid in that endeavor. The process and approximately half the files are borrowed from the large dataset, so this documentation will only highlight the differences between the two.
Several pieces of software need to work together to get articles from their source, through various analyses, and to the resulting dataset. Almost all are the same as the large dataset.
- Scrape articles - Same
- Write causes - Same
- Read causes - Same
- Add beliefs and locations - Same
- Interpret dates - Same
- Find nearest locations - Although there are still locations included for each sentence and each context window around a sentence, there's was not previously a direct indication of where in the window the locations are found, and the context may not be appropriately sized anyway, so now the nearest location is identified for each sentence. In this way nearly every sentence is associated with a location. Whether it is one suitable to use is left to the consumer of the data. There are four extra columns involved, so check below for details.
News articles were collected from seven sources, which is one less than before. One was removed due to terms of service requirements. Another, 3News, has added a CAPTCHA mechanism since the last dataset was put together, so there are no new articles from there. Here are the sources:
- 3News - only articles from the large dataset
- Adom Online
- The Chronicle
- CITI FM
- e.tv ghana
- GhanaWeb - now with articles back to 2017
- Happy FM
Only very simple search terms were used. It is not known to what extent any kind of operators are supported or even what happens if spaces are used in the search. It seems clear from the results that more recent article are favored. GhanaWeb does offer some settings related to date of publication and they were used to specify years from 2017 through 2023. The previous dataset included only 2023 for the older search terms, but the other years have now been backfilled. Again, gold often matched in the context of sporting events, but there wasn't an obvious way to prevent that. These search terms were used with the new ones in bold:
- crop
- galamsey
- gold
- harvest
- livestock
- mining
- price
The counts below describe the dataset size in various dimensions. The first few tables focus on files, which correspond to articles (modulo duplication). The later tables are based on sentences.
Description | Count |
---|---|
Pages downloaded | 67726 |
Files scraped* | 67700 |
Files processed by Eidos+ | 67682 |
*Some pages were not articles, but error pages stating that the article couldn't be found. Some pages matched multiple search terms. The count here and on the next line include the duplicate pages. Duplicates are first accounted for at the sentence level. +Some files could not be read because of a bug in the processing of holidays.
The processed files are distributed across search terms as such:
Search Term | Count |
---|---|
crop | 9495 |
galamsey | 9256 |
gold | 14492 |
harvest | 5489 |
livestock | 1400 |
mining | 13252 |
price | 14298 |
total | 67682 |
The processed files are distributed across sources like this:
Source | Count |
---|---|
3News | 3891 |
Adom Online | 14757 |
The Chronicle | 2119 |
CITI FM | 4938 |
e.tv ghana | 1725 |
GhanaWeb | 38102 |
Happy FM | 2150 |
total | 67682 |
Sentences from the files are classified as to whether they are part of a causal assertion or express a belief. They break down like this:
| Sentences | 1366007 | | Causal sentences | 95740 | | Belief sentences | 122618 | | Sentences both causal and belief | 11525 | | Sentences neither causal nor belief | 1159174 |
Finally, sentences are spread out across article publication date like this:
Year | Count |
---|---|
2001 | 35 |
2013 | 189 |
2014 | 8530 |
2015 | 9799 |
2016 | 15743 |
2017 | 216963 |
2018 | 159015 |
2019 | 156083 |
2020 | 185409 |
2021 | 184535 |
2022 | 236369 |
2023 | 193337 |
total | 1366007 |
The dataset contains quite a few columns. Several are intended to address the TPI use case in which causes and effects can be increasing or decreasing.
- url - The URL from which the article was downloaded.
- terms - The term (or terms separated by a space) which led to the page. Articles are deduplicated per source on matching URLs.
- date - The dateline from the article, if found, verbatim. A canonicalized version is now available in a later column.
- sentenceIndex - The index of each sentence per article as tokenized by Eidos.
- sentence = The text of the sentence. Line feeds and tabs have been replaced with spaces to make the tsv file easier to read.
- causal - A Boolean to indicate whether the sentence includes a causal relation.
- causalIndex - A sentence can contain multiple causal relations, so each is numbered and listed separately, with each of the preceding columns duplicated.
- negationCount - A Boolean to indicate whether the relation is negated. That would mean, for example, that it didn't apply or happen.
- causeIncCount - The number of phrases indicating that the cause increased or that there is more of it. They include words such as accelerate, boost, promote, and strengthen. They are called increase_triggers.
- causeDecCount - The number of phrases indicating that the cause decreased or that there is less of it. They include words such as abate, curtail, decrease, and prohibit. They are called decrease_triggers.
- causePosCount - The number of phrases indicating that the cause is positive positive_affect_triggers. Words to that end include aide, ease, relieve, better, and good.
- causeNegCount - The number of phrases indicating that the cause is negative negative_affect_triggers. Words to that end include challenge, threaten, and worse.
- effectIncCount - As causeIncCount, but for the effect.
- effectDecCount - As causeDecCount, but for the effect.
- effectPosCount - As causePosCount, but for the effect.
- effectNegCount - As causeNegCount, but for the effect.
- causeText - The text of the cause.
- effectText - The text of the effect.
- belief - A Boolean indicating whether the sentence (with possible help from the previous sentence if something like "they" or "this" needs to be resolved) contains a belief.
- sent_locs - The geophysical location mentioned in the sentence, if any. This location and the next are comma-separated lists of some region and then in parentheses the latitude and longitude associated with the region: for example, Abronye (7.69381, -1.9091), Efuanta (5.28527, -2.00557).
- context_locs - Locations mentioned within the previous or next three sentences around the belief.
- canonicalDate - The article's date from the date column is converted here to one of three formats (so far): YYYY-MM-DDTHH:MM:SS or YYYY-MM-DDTHH:MM if a time was specified at all for the article, or YYYY-MM-DD if there was only a date.
Boolean values are written "True" and "False".
If causal is false, then columns for causalIndex, negationCount, causeIncCount, causeDecCount, causePosCount, causeNegCount, effectIncCount, effectDecCount, effectPosCount, and effectNegCount are empty. If there are no locations for sent_locs or context_locs, those columns are also empty.
Some files are borrowed from other projects:
- belief model - The model should be downloaded automatically when needed, but it can also be done in advance.
- locations file - This is small enough to include with the source code.
- Datasets
- Grid
- Habitus Application
- Other