-
Notifications
You must be signed in to change notification settings - Fork 6
Dataset Pipeline
On the wiki pages documenting datasets (Large Ghanaian Dataset, Larger Ghanaian Dataset, Ugandan Dataset), a pipeline for turning a website or local document collection into a dataset is vaguely described. Here that vagueness is supplemented with links to actual code for a more concrete explanation, perhaps enough so that others can run the pipeline.
Disclaimer: there are plenty of other programs that can be used to download and/or scrape websites and pages, which is how the pipeline starts. The process described here is probably not the world's best way to go about collecting the initial texts. Please explore your options first.
One way organize the pipeline is to divide it into three parts. It begins with tracking down texts, usually on the internet, downloading them, and extricating them from other accompanying information (e.g., HTML tags, advertising, PDF formatting), which is called scraping here. After that, there is processing, particularly natural language processing (NLP). The resulting "datasets" in the form of TSV (tab-separated values) files were at one time the final output format, but recently we have been adding (uploading) them to an Elasticsearch database running on a server so that datasets can be combined and queried from afar.
The download process has several variations and the most complicated, one based on site-specific searches, is described first, because the others are based on it. There are three stages and each stage has both a download part and a scrape part. The stages are
- submitting the initial search query and downloading and scraping the preliminary result
- generating queries to provoke return of multiple pages of search results and collecting article links
- downloading the articles from their links and extracting text along with metadata
More concretely, these are the programs that are called for this variation of this portion of the pipeline:
- Search
-
SearchDownloaderApp - This program processes the file
searchcorpus.txt
(or equivalent) and downloads the response to searches specified there to asearches
directory, which is usually located strategically so that the region and search term are encoded in the path. From the URLs insearchcorpus.txt
(and similarly forindexcorpus.txt
andarticlecorpus.txt
) and definitions of domains in the code, downloaders can be created. They are responsible for the specifics of downloading from their sites. The specification of the search insearchcorpus.txt
can be as simple ashttps://www.adomonline.com/?s=galamsey
with agalamsey
hint, but more complicated scenarios are possible. -
SearchScraperApp - After the initial response to the search query has been retrieved, it must be extracted from the web page, which is what this program does. It uses the original
searchcorpus.txt
along with downloads in thesearches
directory to generate a list of "index" pages inindexcorpus.txt
for the next stage. In scraping, it is primarily trying to find out how many pages of results there are and then using knowledge about the site to generate the URLs to those "index" pages and record them. Pages in all stages (search, index, and article) are processed using scrapers based on scala-scraper which uses the jsoup HTML parser.
-
SearchDownloaderApp - This program processes the file
- Index
-
IndexDownloaderApp - Next, pages (indexes) of search results need to be downloaded. This program takes the list of pages from
indexcorpus.txt
and writes them to anindexes
directory, which is usually a sibling of thesearches
directory. Lines of the index corpus are very straightforward and might be fromhttps://www.adomonline.com/page/1/?s=galamsey
tohttps://www.adomonline.com/page/44/?s=galamsey
. -
IndexScraperApp - Web pages in
indexes
need to be processed in order to find links to the individual articles. The program takes information from fileindexcorpus.txt
and directoryindexes
and producesarticlecorpus.txt
with links to articles. There are typically 10 to 20 articles listed in each index.
-
IndexDownloaderApp - Next, pages (indexes) of search results need to be downloaded. This program takes the list of pages from
- Article
-
ArticleDownloaderApp - For the last download, the program gets the list of articles from
articlecorpus.txt
and writes downloaded articles to anarticles
directory, usually a sibling of the other two download directories. Lines in the corpus file are usually very similar to what someone would type into the address bar of a browser, likehttps://www.adomonline.com/galamsey-cocobod-loses-¢4-8-billion-in-western-region/
. -
ArticleScraperApp - The final scraper is the most complex and is responsible for extracting text as well as metadata from article pages. It can usually only be programmed after observing around 10 articles and figuring out how an article's text is recorded. Afterwards, the processing of around 100 articles is usually observed carefully for signs of deviation from the expected format. The article scrapers also extract metadata like the title, byline, and dateline of the article. Text is almost always embedded in the HTML and not necessarily easy to distinguish from advertisements, links and summaries of related articles, boilerplate page elements, etc. Metadata is often contained in scripts of type
application/ld+json
which necessitates an additionaljson
parser. The primary output of this stage is a.json
file for each article describing the text and metadata. It is often written right beside the HTML file for an article.
-
ArticleDownloaderApp - For the last download, the program gets the list of articles from
Several variations of the above procedure have been concocted to handle other scenarios. It may help to look through collected corpus.txt files to find good examples.
- Google searches - The specifications in
searchcorpus.txt
aren't limited to links to news sites. They can instead point to Google's custom search API. Users need to supply their own API key, and use may incur charges. Since generic search results might identify documents from any site and then require custom scrapers for each of the sites, it works best to specify a specific, standardized file type, like PDF. If PDFs can be downloaded, then the custom HTML scrapers are not necessary and an extractor from pdf2txt is used instead. The specification insearchcorpus.txt
might behttps://customsearch.googleapis.com/customsearch/v1?cx=${SEARCH_ENGINE_ID}&fileType=pdf&hq=farming&q=uganda&key=${API_KEY}
. This variation also includes the scraper to convert index pages of partial results intoarticlecorpus.txt
. Index pages usually contain 10 results and the cap on free searches is 100 per day so that one can collect up to 1000 documents per day for free with this method. - Local files - If you bring your own documents (BYOD), skip the search and index stages and for the article stage use
file:/<filename>
inarticlecorpus.txt
for your document list. IfArticleScraperApp
is configured with the correct information (base directory name, search term, etc.), then articles will be taken from local storage. They are expected to be PDFs for now, but other extractors can be arranged based on domain or file name. An entry inarticlecorpus.txt
for this scenario might befile:/1-s2.0-S0305750X19303407-main.pdf
. - Sitemaps - Many websites provide lists of all their pages in a sitemap file to help with search engine indexing. If the location of (a file specifying) a sitemap is used in
searchcorpus.txt
, the sitemap will be used to build the lists of indexes and articles. This can result in the identification of large numbers of articles, including large numbers of irrelevant ones. The method should be combined with an ability to search article content after the fact so that relevant information can be identified. The URL in the corpus file normally ends withrobots.txt
in this case.
After a corpus of articles has been collected, quite a bit of processing is still required to perform causal analysis, check for beliefs and sentiment, and identify locations. Multiple programs handle these tasks, in part because they are written in different languages and because runtimes are long and system requirements fairly high. These programs usually run on a server rather than a laptop
-
Step1OutputEidos - Causal analysis is performed by Eidos. This program takes a directory of articles, such as
articles
from above, and for each.json
file creates a.jsonld
file with the standard Eidos output including individual sentences and tokens, causal information, and more. Some of this more like named entities and lemmas will soon be added to the Elasticsearch database to reduce the amount of computation needed at runtime for interactive applications. By this time, each article generally has 4 files associated with it: the downloaded HTML (or sometimes PDF) file, a slightly structured text file, a structured.json
file with metadata, and the.jsonld
file from Eidos. -
Step2InputEidos - This much quicker program reads the
.json
and.jsonld
files for the articles and combines them into the initial TSV file for a corpus of articles. The output filename has often been in the form ofregion-term.tsv
. -
tpi_main.py - The next few stages are implemented in Python, mostly because they access Huggingface sentence transformers which Habitus team members have customized. There is but a single application, tpi_main.py, but it includes several processing stages.
- TpiResolutionStage - This pipeline stage handles coreference resolution for beliefs.
- TpiBeliefStage - The belief model gets run here.
- TpiSentimentStage - After that, the sentiment model is applied.
- TpiLocationStage - Finally, locations mentioned in sentences are identified. This is made possible in part by geoname collections for regions of interest.
-
Step3InterpretDates - The Python program outputs a new TSV often called
region-term-a.tsv
. This program takes the file and converts it toregion-term-b.tsv
by interpreting thedateline
column and canonicalizing the date. Dates found written in a miscellany of text formats are converted into a standard format suitable for database queries. -
Step4FindNearestLocation - Finally, some calculations are performed on locations so that it is easier to ask questions about where locations are mentioned in the neighborhood of a sentence in case they still (or already) apply to a sentence that doesn't have a location itself. This feature has seldom been used, but the resulting file is often called
region-term-c.tsv
and is considered the basic output of the entire pipeline.
For various and sundry reasons, including the intended deprecation of the TSV datasets, an alternative processing path exists which combines the above processing stages into as few steps as possible and adds an upload to Elasticsearch.
- Step1OutputEidos - same as above
- Step2InputEidos1App - is similar to the other Step2 processing stage above, but it outputs a minimum number of columns and does not include any that the Python stage ignores anyway.
- tpi_main.py - same as above
- vector_main.py - adds a sentence transformer vector for each sentence
-
Step2InputEidos2App - combines the
Step3InterpretDates
andStep4FindNearestLocation
stages from above into a single stage. In addition, it retrieves any columns not yet handled by the abbreviatedStep2InputEidos1App
and it writes everything it has collected or calculated not to a TSV file but to the Elasticsearch database.
- Datasets
- Grid
- Habitus Application
- Other