The following is an explanation of the dataset processing flow for IATI.cloud.
We use the code4iati dataset metadata and publisher metadata dumps to access all of the available metadata.
- publisher: we basically immediately index the publisher metadata as it is flat data.
- dataset: We download the code4iati dataset dump to access all of the available IATI datasets from the IATI Registry. If
update
is true, we check whether or not the hash has changed from the already indexed datasets. We then loop the datasets within the dataset metadata dump and trigger thesubtask_process_dataset
. For each dataset we clean the dataset metadata (where we extract the nestedresources
andextras
). We then retrieve the filepath of the actual downloaded dataset based on the organisation name and dataset name. We check if the version is valid (in this case version 2). We get the type of the file from the metadata or the file content itself. We then check the dataset validation. Then we clear the existing data from this dataset if it is found in the IATI.cloud and theupdate
flag is True. Then we trigger the indexing of the actual dataset. Once this is completed we store the success state of the latter toiati_cloud_indexed
and we index the entire dataset metadata.
First, we parse the IATI XML dataset. We then convert it to a dict using the BadgerFish algorithm.
We apply our cleaning and add custom fields. We then dump the dataset dict into a JSON file. Latstly, we extract the subtypes (budget, result and transactions)
We then recursively clean the dataset. @
values are removed, @{http://www.w3.org/XML/1998/namespace}lang
is replaced with lang
, and key-value fields are extracted. Read more here.
We have several "custom fields" that we enrich the IATI data with.
- Codelist fields: These fields are 'name' representations of numeric/code values in the IATI Standard, for example an activity can report
transaction-type.code: 3
. We then enrich the activity withtransaction-type.name: Disbursement
. - Title narrative: We add a single-valued field with exclusively the first-reported title narrative.
- Common activity dates: We add single value common start and end dates, so we immediately know a start and an end-date without looking through the planned and actual fields.
- Combined policy marker: We add
policy-marker.combined
which is the policy marker code and its connected significance together. - Currency conversion: Explained in depth here.
- Dataset metadata: We add interesting dataset metadata fields to the activity.
- Hierarchy default value: "If hierarchy is not reported then 1 is assumed.". Ensure this is enforced.
- JSON dumps: A stringified JSON object of different IATI activity fields.
- Date quarters: For each iso-date reported, also include a field in which quarter they are.
- Document link categories: Provides a combined list of all the category codes for each document-link.
- Currency aggregation: We add converted and aggregated values for budgets, disbursements and transactions/transaction subtypes.
- Related activity data to parent activity: This 'raises' related activity budget data from the H2 activities to the H1 activities.
We extract the subtypes to single valued fields. Read more here.
Each of these is indexed separately into its respective core.
Lastly, if the previous steps were all successful, we index the IATI activity data.