Skip to content

Dataset Pipeline

Keith Alcock edited this page Dec 16, 2024 · 16 revisions

Introduction

On the wiki pages documenting datasets (Large Ghanaian Dataset, Larger Ghanaian Dataset, Ugandan Dataset), a pipeline for turning a website or local document collection into a dataset is vaguely described. Here that vagueness is supplemented with links to actual code for a more concrete explanation, perhaps enough so that others can run the pipeline.

Disclaimer: there are plenty of other programs that can be used to download and/or scrape websites and pages, which is how the pipeline starts. The process described here is probably not the world's best way to go about collecting the initial texts. Please explore your options first.

Pipeline

One way organize the pipeline is to divide it into three parts. It begins with tracking down texts, usually on the internet, downloading them, and extricating them from other accompanying information (e.g., HTML tags, advertising, PDF formatting), which is called scraping here. After that, there is processing, particularly natural language processing (NLP). The resulting "datasets" in the form of TSV (tab-separated values) files were at one time the final output format, but recently we have been adding (uploading) them to an Elasticsearch database running on a server so that datasets can be combined and queried from afar.

Download

The download process has several variations and the most complicated, one based on site-specific searches, is described first, because the others are based on it. There are three stages and each stage has both a download part and a scrape part. The stages are

  1. submitting the initial search query and downloading and scraping the preliminary result
  2. generating queries to provoke return of multiple pages of search results and collecting article links
  3. downloading the articles from their links and extracting text along with metadata

More concretely, these are the programs that are called for this variation of this portion of the pipeline:

  1. Search
    1. SearchDownloaderApp - This program processes the file searchcorpus.txt (or equivalent) and downloads the response to searches specified there to a searches directory, which is usually located strategically so that the region and search term are encoded in the path. From the URLs in searchcorpus.txt (and similarly for indexcorpus.txt and articlecorpus.txt) and definitions of domains in the code, downloaders can be created. They are responsible for the specifics of downloading from their sites. The specification of the search in searchcorpus.txt can be as simple as https://www.adomonline.com/?s=galamsey with a galamsey hint, but more complicated scenarios are possible.
    2. SearchScraperApp - After the initial response to the search query has been retrieved, it must be extracted from the web page, which is what this program does. It uses the original searchcorpus.txt along with downloads in the searches directory to generate a list of "index" pages in indexcorpus.txt for the next stage. In scraping, it is primarily trying to find out how many pages of results there are and then using knowledge about the site to generate the URLs to those "index" pages and record them. Pages in all stages (search, index, and article) are processed using scrapers based on scala-scraper which uses the jsoup HTML parser.
  2. Index
    1. IndexDownloaderApp - Next, pages (indexes) of search results need to be downloaded. This program takes the list of pages from indexcorpus.txt and writes them to an indexes directory, which is usually a sibling of the searches directory. Lines of the index corpus are very straightforward and might be from https://www.adomonline.com/page/1/?s=galamsey to https://www.adomonline.com/page/44/?s=galamsey.
    2. IndexScraperApp - Web pages in indexes need to be processed in order to find links to the individual articles. The program takes information from file indexcorpus.txt and directory indexes and produces articlecorpus.txt with links to articles. There are typically 10 to 20 articles listed in each index.
  3. Article
    1. ArticleDownloaderApp - For the last download, the program gets the list of articles from articlecorpus.txt and writes downloaded articles to an articles directory, usually a sibling of the other two download directories. Lines in the corpus file are usually very similar to what someone would type into the address bar of a browser, like https://www.adomonline.com/galamsey-cocobod-loses-¢4-8-billion-in-western-region/.
    2. ArticleScraperApp - The final scraper is the most complex and is responsible for extracting text as well as metadata from article pages. It can usually only be programmed after observing around 10 articles and figuring out how an article's text is recorded. Afterwards, the processing of around 100 articles is usually observed carefully for signs of deviation from the expected format. The article scrapers also extract metadata like the title, byline, and dateline of the article. Text is almost always embedded in the HTML and not necessarily easy to distinguish from advertisements, links and summaries of related articles, boilerplate page elements, etc. Metadata is often contained in scripts of type application/ld+json which necessitates an additional json parser. The primary output of this stage is a .json file for each article describing the text and metadata. It is often written right beside the HTML file for an article.

Several variations of the above procedure have been concocted to handle other scenarios. It may help to look through collected corpus.txt files to find good examples.

  • Google searches - The specifications in searchcorpus.txt aren't limited to links to news sites. They can instead point to Google's custom search API. Users need to supply their own API key, and use may incur charges. Since generic search results might identify documents from any site and then require custom scrapers for each of the sites, it works best to specify a specific, standardized file type, like PDF. If PDFs can be downloaded, then the custom HTML scrapers are not necessary and an extractor from pdf2txt is used instead. The specification in searchcorpus.txt might be https://customsearch.googleapis.com/customsearch/v1?cx=${SEARCH_ENGINE_ID}&fileType=pdf&hq=farming&q=uganda&key=${API_KEY}. This variation also includes the scraper to convert index pages of partial results into articlecorpus.txt. Index pages usually contain 10 results and the cap on free searches is 100 per day so that one can collect up to 1000 documents per day for free with this method.
  • Local files - If you bring your own documents (BYOD), skip the search and index stages and for the article stage use file:/<filename> in articlecorpus.txt for your document list. If ArticleScraperApp is configured with the correct information (base directory name, search term, etc.), then articles will be taken from local storage. They are expected to be PDFs for now, but other extractors can be arranged based on domain or file name. An entry in articlecorpus.txt for this scenario might be file:/1-s2.0-S0305750X19303407-main.pdf.
  • Sitemaps - Many websites provide lists of all their pages in a sitemap file to help with search engine indexing. If the location of (a file specifying) a sitemap is used in searchcorpus.txt, the sitemap will be used to build the lists of indexes and articles. This can result in the identification of large numbers of articles, including large numbers of irrelevant ones. The method should be combined with an ability to search article content after the fact so that relevant information can be identified. The URL in the corpus file normally ends with robots.txt in this case.

Process

After a corpus of articles has been collected, quite a bit of processing is still required to perform causal analysis, check for beliefs and sentiment, and identify locations. Multiple programs handle these tasks, in part because they are written in different languages and because runtimes are long and system requirements fairly high. These programs usually run on a server rather than a laptop

  1. Step1OutputEidos - Causal analysis is performed by Eidos. This program takes a directory of articles, such as articles from above, and for each .json file creates a .jsonld file with the standard Eidos output including individual sentences and tokens, causal information, and more. Some of this more like named entities and lemmas will soon be added to the Elasticsearch database to reduce the amount of computation needed at runtime for interactive applications. By this time, each article generally has 4 files associated with it: the downloaded HTML (or sometimes PDF) file, a slightly structured text file, a structured .json file with metadata, and the .jsonld file from Eidos.
  2. Step2InputEidos - This much quicker program reads the .json and .jsonld files for the articles and combines them into the initial TSV file for a corpus of articles. The output filename has often been in the form of region-term.tsv.
  3. tpi_main.py - The next few stages are implemented in Python, mostly because they access Huggingface sentence transformers which Habitus team members have customized. There is but a single application, tpi_main.py, but it includes several processing stages.
    1. TpiResolutionStage - This pipeline stage handles coreference resolution for beliefs.
    2. TpiBeliefStage - The belief model gets run here.
    3. TpiSentimentStage - After that, the sentiment model is applied.
    4. TpiLocationStage - Finally, locations mentioned in sentences are identified. This is made possible in part by geoname collections for regions of interest.
  4. Step3InterpretDates - The Python program outputs a new TSV often called region-term-a.tsv. This program takes the file and converts it to region-term-b.tsv by interpreting the dateline column and canonicalizing the date. Dates found written in a miscellany of text formats are converted into a standard format suitable for database queries.
  5. Step4FindNearestLocation - Finally, some calculations are performed on locations so that it is easier to ask questions about where locations are mentioned in the neighborhood of a sentence in case they still (or already) apply to a sentence that doesn't have a location itself. This feature has seldom been used, but the resulting file is often called region-term-c.tsv and is considered the basic output of the entire pipeline.

Upload

For various and sundry reasons, including the intended deprecation of the TSV datasets, an alternative processing path exists which combines the above processing stages into as few steps as possible and adds an upload to Elasticsearch.

  1. Step1OutputEidos - same as above
  2. Step2InputEidos1App - is similar to the other Step2 processing stage above, but it outputs a minimum number of columns and does not include any that the Python stage ignores anyway.
  3. tpi_main.py - same as above
  4. vector_main.py - adds a sentence transformer vector for each sentence
  5. Step2InputEidos2App - combines the Step3InterpretDates and Step4FindNearestLocation stages from above into a single stage. In addition, it retrieves any columns not yet handled by the abbreviated Step2InputEidos1App and it writes everything it has collected or calculated not to a TSV file but to the Elasticsearch database.