Skip to content

Preprocessing Harvester Guide

Kevin edited this page Dec 28, 2020 · 2 revisions

ECCO Preprocessing Harvester Guide

Table of Contents


PODAAC

  • Description
    • PODAAC files are harvested by obtaining a data file’s OPeNDAP URL from a PODAAC XML tree.
  • What it does
    • The PODAAC harvester constructs a URL using the dataset’s PODAAC ID and start and end date, all contained within the harvester_config.yaml file. The URL returns an XML document containing entries for the dataset within the specified range. These entries contain metadata about a specific data file. The metadata used in the harvester code includes an OPeNDAP URL which points to the actual NetCDF data file, a start and end time specifying the time range of the actual data file, and an updated time signifying when the data file was last updated.
    • The harvester iterates through the XML document entries and performs three checks using the Solr database to decide if the entry’s data file should be downloaded:
      • It will download if a Solr entry for this data file does not already exist.
      • It will download if the existing Solr entry was not previously downloaded successfully.
      • It will download if the existing Solr entry was last successfully downloaded prior to an updated data file on PODAAC.
    • If the file needs to be downloaded, the relevant metadata is used to either create or modify a harvested type entry in the Solr database. As the XML entries are iterated through, the newly created or modified Solr entries are added to a list. This list of Solr entries are posted to the actual database once the XML iteration is complete. This list contains both the “harvested” type Solr entries as well as the “descendants” type Solr entries which are modified at each step of the pipeline.
    • After the individual granule entries are posted to Solr, the Solr dataset entry (or metadata concerning the dataset as a whole, and the harvesting process) is either created or updated with the current harvest attempt:
      • If the dataset entry does not exist on Solr (meaning this is the first harvest attempt for this dataset), the dataset metadata entry is created and posted to Solr along with the field (or dataset variable) metadata used for the Solr “field” type entries.
      • If the dataset entry already exists on Solr, the dataset entry is updated with new dates for the relevant dataset entry fields (ex: download_time_dt and last_checked_dt).
  • Notes
    • The PODAAC harvester accommodates datasets that come as a single aggregated data file. The aggregated file is continuously updated with contemporaneous data, however the updated key in the XML file only contains the start date of the data. This means there is no way to tell if any portion of the data prior to the most recent addition has been updated, so the harvester will download the aggregated file each time it is run.
    • The original URL to get the XML document is modified to remove the date range, and the aggregated file is downloaded. The downloaded file is opened and the time values from the netcdf are extracted. Those time values that are contained within the start and end date range contained in the harvester_config.yaml file are used to create individual data slices which are saved to disk. Individual harvested (and descendant) type Solr entries are posted to the database for each manually created time slice granule.

NSIDC

  • Description

    • NSIDC files are harvested by iterating through multiple directories within a dataset’s FTP. The files within the relevant FTP directories are filtered for the desired date range as specified in the harvester_config.yaml file. Those within the range are downloaded before moving to the next directory.
  • What it does

    • The NSIDC harvester iterates through FTP directories by constructing an FTP path that takes the general form of host/dataset/hemisphere/time scale of data/year/. The harvester iterates through the hemispheres relevant to the dataset as well as the years from within the range created from the start and end dates found in the harvester_config.yaml file.
    • The date of the data within each directory is extracted from the filenames, and those outside of the start and end dates found in the harvester_config.yaml file are ignored. The harvester attempts to retrieve the time the file was modified, but that isn’t provided for every data set, defaulting to the current time when it can’t be accessed.
    • The harvester iterates through the files within the FTP directory and performs three checks using the Solr database to decide if the entry’s data file should be downloaded:
      • It will download if a Solr entry for this data file does not already exist.
      • It will download if the existing Solr entry was not previously downloaded successfully.
      • It will download if the existing Solr entry was last successfully downloaded prior to an updated data file on the FTP.
    • If the file needs to be downloaded, the relevant metadata is used to either create or modify a harvested type entry in the Solr database. As the FTP files are iterated through, the newly created or modified Solr entries are added to a list. This list of Solr entries is posted to the actual database only once the entire FTP iteration is complete (meaning only once all relevant directories have been iterated through). This list contains both the “harvested” type Solr entries as well as the “descendants” type Solr entries which are modified at each step of the pipeline.
    • After the individual granule entries are posted to Solr, the Solr dataset entry (or metadata concerning the dataset as a whole, and the harvesting process) is either created or updated with the current harvest attempt:
      • If the dataset entry does not exist on Solr (meaning this is the first harvest attempt for this dataset), the dataset metadata entry is created and posted to Solr along with the field (or dataset variable) metadata used for the Solr “field” type entries.
      • If the dataset entry already exists on Solr, the dataset entry is updated with new dates for the relevant dataset entry fields (ex: download_time_dt and last_checked_dt).
  • Notes

    • Unfortunately, not all datasets available from NSIDC are accessible using a singular method. Two of the three datasets currently used in the pipeline are only accessible via FTP, while the third (RDEFT4) uses the more advanced Earthdata Search.
    • The FTP datasets lack any query functionalities, so FTP directory paths must be constructed, navigated to, and manually filtered through for the desired data. This is inherently a slower process (both in navigation and download speeds) than PODAAC’s XML and OPeNDAP architecture as it provides less flexibility, usability, and metadata.
    • The RDEFT4 dataset requires its own version of the NSIDC harvester, as it doesn’t use an FTP repo. The NSIDC website provides an autogenerated python script for downloading data files. The RDEFT4 harvester uses a modified version of this script to gain access to the data and include all of the metadata and Solr functionality used by every other dataset. Functionally it is similar to the main NSIDC harvester, but it uses a CMR search instead of an FTP directory. Each data file has an associated XML file (accessible by adding ‘.xml’ to the data file URL) which contains metadata similar to that provided by PODAAC. RDEFT4 also requires additional login parameters to access the data.

OSISAF

  • Description

    • OSISAF files are harvested by iterating through multiple directories within a dataset’s FTP. The files within the relevant FTP directories are filtered for the desired date range as specified in the harvester_config.yaml file as well as for the type of file, since more than one can exist within a directory. The FTP is hosted in Norway so patience is required for navigating and downloading.
  • What it does

    • The OSISAF harvester iterates through FTP directories by constructing an FTP path that takes the general form of host/dataset/year/month/. The harvester creates a date range the start and end dates found in the harvester_config.yaml file and iterates through the FTP directories relevant to the dataset.
    • Each directory contains both northern and southern hemisphere data (if applicable) and can contain multiple types of files for the same hemisphere and date combination. The harvester_config.yaml file contains a filname_filter field, which tells the harvester to only look at the files within a directory that contain that string.
    • The date of the data within each directory is extracted from the filenames, and those outside of the start and end dates found in the harvester_config.yaml file are ignored. The harvester attempts to retrieve the time the file was modified, but that isn’t provided for every data set, defaulting to the current time when it can’t be accessed.
    • The harvester iterates through the files within the FTP directory and performs three checks using the Solr database to decide if the entry’s data file should be downloaded:
      • It will download if a Solr entry for this data file does not already exist.
      • It will download if the existing Solr entry was not previously downloaded successfully.
      • It will download if the existing Solr entry was last successfully downloaded prior to an updated data file on the FTP.
    • If the file needs to be downloaded, the relevant metadata is used to either create or modify a harvested type entry in the Solr database. As the FTP files are iterated through, the newly created or modified Solr entries are added to a list. This list of Solr entries is posted to the actual database only once the entire FTP iteration is complete (meaning only once all relevant directories have been iterated through). This list contains both the “harvested” type Solr entries as well as the “descendants” type Solr entries which are modified at each step of the pipeline.
    • After the individual granule entries are posted to Solr, the Solr dataset entry (or metadata concerning the dataset as a whole, and the harvesting process) is either created or updated with the current harvest attempt:
      • If the dataset entry does not exist on Solr (meaning this is the first harvest attempt for this dataset), the dataset metadata entry is created and posted to Solr along with the field (or dataset variable) metadata used for the Solr “field” type entries.
      • If the dataset entry already exists on Solr, the dataset entry is updated with new dates for the relevant dataset entry fields (ex: download_time_dt and last_checked_dt).
  • Notes

    • OSISAF is slower to navigate (and download files) than NSIDC. This seems to be a result of the data being hosted in Norway.

ECCO Pipeline

Guides

ECCO Cloud Utils

Clone this wiki locally