This documents mirrors the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) and is updated as the specification evolves.
This document explains the structure and content of an ESM Catalog. A catalog provides metadata about the catalog, telling us what we expect to find inside and how to open it. The catalog is described is a single json file, inspired by the STAC spec.
The ESM Catalog specification consists of three parts:
The catalog specification provides metadata about the catalog, telling us what we expect to find inside and how to open it. The descriptor is a single json file, inspired by the STAC spec.
{
"esmcat_version": "0.1.0",
"id": "sample",
"description": "This is a very basic sample ESM catalog.",
"catalog_file": "sample_catalog.csv",
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
{
"column_name": "source_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_source_id.json"
}
],
"assets": {
"column_name": "path",
"format": "zarr"
}
}
The collection points to a single catalog. A catalog is a CSV file. The meaning of the columns in the csv file is defined by the parent collection.
activity_id,source_id,path
CMIP,ACCESS-CM2,gs://pangeo-data/store1.zarr
CMIP,GISS-E2-1-G,gs://pangeo-data/store1.zarr
The data assets can be either netCDF or Zarr. They should be either URIs or full filesystem paths.
Element | Type | Description |
---|---|---|
esmcat_version | string | REQUIRED. The ESM Catalog version the catalog implements. |
id | string | REQUIRED. Identifier for the catalog. |
title | string | A short descriptive one-line title for the catalog. |
description | string | REQUIRED. Detailed multi-line description to fully explain the catalog. CommonMark 0.28 syntax MAY be used for rich text representation. |
catalog_file | string | REQUIRED. Path to a the CSV file with the catalog contents. |
catalog_dict | array | If specified, it is mutually exclusive with catalog_file . An array of dictionaries that represents the data that would otherwise be in the csv. |
attributes | [Attribute Object] | REQUIRED. A list of attribute columns in the data set. |
assets | Assets Object | REQUIRED. Description of how the assets (data files) are referenced in the CSV catalog file. |
aggregation_control | Aggregation Control Object | OPTIONAL. Description of how to support aggregation of multiple assets into a single xarray data set. |
An attribute object describes a column in the catalog CSV file. The column names can optionally be associated with a controlled vocabulary, such as the CMIP6 CVs, which explain how to interpret the attribute values.
Element | Type | Description |
---|---|---|
column_name | string | REQUIRED. The name of the attribute column. Must be in the header of the CSV file. |
vocabulary | string | Link to the controlled vocabulary for the attribute in the format of a URL. |
An assets object describes the columns in the CSV file relevant for opening the actual data files.
Element | Type | Description |
---|---|---|
column_name | string | REQUIRED. The name of the column containing the path to the asset. Must be in the header of the CSV file. |
format | string | The data format. Valid values are netcdf , zarr , opendap or reference (kerchunk reference files). If specified, it means that all data in the catalog is the same type. |
format_column_name | string | The column name which contains the data format, allowing for variable data types in one catalog. Mutually exclusive with format . |
An aggregation control object defines neccessary information to use when aggregating multiple assets into a single xarray data set.
Element | Type | Description |
---|---|---|
variable_column_name | string | REQUIRED. Name of the attribute column in csv file that contains the variable name. |
groupby_attrs | array | Column names (attributes) that define data sets that can be aggegrated. |
aggregations | [Aggregation Object] | OPTIONAL. List of aggregations to apply to query results |
An aggregation object describes types of operations done during the aggregation of multiple assets into a single xarray data set.
Element | Type | Description |
---|---|---|
type | string | REQUIRED. Type of aggregation operation to apply. Valid values include: join_new , join_existing , union |
attribute_name | string | Name of attribute (column) across which to aggregate. |
options | object | OPTIONAL. Aggregration settings that are passed as keywords arguments to xarray.concat() or xarray.merge() . For join_existing , it must contain the name of the existing dimension to use (for e.g.: something like {'dim': 'time'} ). |