Skip to content

Commit

Permalink
Merge branch 'develop' of github.com:ecmwf/anemoi-datasets into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
b8raoult committed Oct 9, 2024
2 parents 7c23b07 + 904e102 commit 0a5fa5e
Show file tree
Hide file tree
Showing 18 changed files with 130 additions and 92 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,9 @@ on:
jobs:
# Run CI including downstream packages on self-hosted runners
downstream-ci:

name: downstream-ci
if: ${{ !github.event.pull_request.head.repo.fork && github.event.action != 'labeled' || github.event.label.name == 'approved-for-ci' }}
if: ${{ !contains(github.repository, 'private') && (!github.event.pull_request.head.repo.fork && github.event.action != 'labeled' || github.event.label.name == 'approved-for-ci') }}
uses: ecmwf-actions/downstream-ci/.github/workflows/downstream-ci.yml@main
with:
anemoi-datasets: ecmwf/anemoi-datasets@${{ github.event.pull_request.head.sha || github.sha }}
Expand All @@ -46,7 +47,7 @@ jobs:
# Build downstream packages on HPC
downstream-ci-hpc:
name: downstream-ci-hpc
if: ${{ !github.event.pull_request.head.repo.fork && github.event.action != 'labeled' || github.event.label.name == 'approved-for-ci' }}
if: ${{ !contains(github.repository, 'private') && (!github.event.pull_request.head.repo.fork && github.event.action != 'labeled' || github.event.label.name == 'approved-for-ci') }}
uses: ecmwf-actions/downstream-ci/.github/workflows/downstream-ci-hpc.yml@main
with:
anemoi-datasets: ecmwf/anemoi-datasets@${{ github.event.pull_request.head.sha || github.sha }}
Expand Down
18 changes: 12 additions & 6 deletions .github/workflows/push-to-private.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Push to another repository
name: Push to private repository

on:
push:
Expand All @@ -7,21 +7,27 @@ on:

jobs:
push_changes:
if: ${{ !contains(github.repository, 'private') }}
runs-on: ubuntu-latest

steps:
- name: Checkout source repository
uses: actions/checkout@v3
with:
fetch-depth: 0
fetch-tags: true

- name: Set up Git configuration
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
- name: Setup SSH key
uses: webfactory/[email protected]
with:
ssh-private-key: ${{ secrets.KEY_TO_PRIVATE }}

- name: Push changes to private repository
# env:
# MLX_TOKEN: ${{ secrets.MLX_TOKEN }}
run: |
git remote add private https://${{ secrets.MLX_TOKEN }}@github.com/ecmwf-lab/anemoi-datasets-private.git
git fetch private
git push private develop
git remote add private [email protected]:${{ github.repository }}-private.git
git push --set-upstream private develop
1 change: 1 addition & 0 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ on:

jobs:
quality:
if: ${{ !contains(github.repository, 'private') }}
uses: ecmwf-actions/reusable-workflows/.github/workflows/qa-precommit-run.yml@v2
with:
skip-hooks: "no-commit-to-branch"
Expand Down
8 changes: 4 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ repos:
- id: clear-notebooks-output
name: clear-notebooks-output
files: tools/.*\.ipynb$
stages: [commit]
stages: [pre-commit]
language: python
entry: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
additional_dependencies: [jupyter]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v5.0.0
hooks:
- id: check-yaml # Check YAML files for syntax errors only
args: [--unsafe, --allow-multiple-documents]
Expand Down Expand Up @@ -40,7 +40,7 @@ repos:
- --force-single-line-imports
- --profile black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.4
rev: v0.6.9
hooks:
- id: ruff
# Next line if for documenation cod snippets
Expand All @@ -66,7 +66,7 @@ repos:
- id: docconvert
args: ["numpy"]
- repo: https://github.com/tox-dev/pyproject-fmt
rev: "2.2.3"
rev: "2.2.4"
hooks:
- id: pyproject-fmt

Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ Keep it human-readable, your future self will thank you!
- Update documentation

- Update documentation
### Changed

- Add `variables_metadata` entry in the dataset metadata

### Changed

- Add `variables_metadata` entry in the dataset metadata

## [0.5.5](https://github.com/ecmwf/anemoi-datasets/compare/0.5.4...0.5.5) - 2024-10-04

Expand Down Expand Up @@ -55,6 +62,7 @@ Keep it human-readable, your future self will thank you!
- Bug fix when creating dataset from zarr
- Bug fix with area selection in cutout operation
- add paths-ignore to ci workflow
- call provenance less often

### Removed

Expand Down
4 changes: 4 additions & 0 deletions src/anemoi/datasets/commands/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,10 @@ def name_to_index(self):
def variables(self):
return self.metadata["variables"]

@property
def variables_metadata(self):
return self.metadata.get("variables_metadata", {})


class Version0_12(Version0_6):
def details(self):
Expand Down
43 changes: 4 additions & 39 deletions src/anemoi/datasets/create/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
from anemoi.utils.dates import frequency_to_timedelta
from anemoi.utils.humanize import compress_dates
from anemoi.utils.humanize import seconds_to_human
from anemoi.utils.sanitise import sanitise
from earthkit.data.core.order import build_remapping

from anemoi.datasets import MissingDateError
Expand Down Expand Up @@ -52,7 +53,7 @@

LOG = logging.getLogger(__name__)

VERSION = "0.20"
VERSION = "0.30"


def json_tidy(o):
Expand Down Expand Up @@ -325,43 +326,6 @@ def build_input_(main_config, output_config):
return builder


def tidy_recipe(config: object):
"""Remove potentially private information in the config"""
config = deepcopy(config)
if isinstance(config, (tuple, list)):
return [tidy_recipe(_) for _ in config]
if isinstance(config, (dict, DotDict)):
for k, v in config.items():
if k.startswith("_"):
config[k] = "*** REMOVED FOR SECURITY ***"
else:
config[k] = tidy_recipe(v)
if isinstance(config, str):
if config.startswith("_"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("s3://"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("gs://"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("http"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("ftp"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("file"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("ssh"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("scp"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("rsync"):
return "*** REMOVED FOR SECURITY ***"
if config.startswith("/"):
return "*** REMOVED FOR SECURITY ***"
if "@" in config:
return "*** REMOVED FOR SECURITY ***"
return config


class Init(Actor, HasRegistryMixin, HasStatisticTempMixin, HasElementForDataMixin):
dataset_class = NewDataset
def __init__(self, path, config, check_name=False, overwrite=False, use_threads=False, statistics_temp_dir=None, progress=None, test=False, cache=None, **kwargs): # fmt: skip
Expand Down Expand Up @@ -448,7 +412,7 @@ def _run(self):
metadata.update(self.main_config.get("add_metadata", {}))

metadata["_create_yaml_config"] = self.main_config.get_serialisable_dict()
metadata["recipe"] = tidy_recipe(self.main_config.get_serialisable_dict())
metadata["recipe"] = sanitise(self.main_config.get_serialisable_dict())

metadata["description"] = self.main_config.description
metadata["licence"] = self.main_config["licence"]
Expand All @@ -467,6 +431,7 @@ def _run(self):
metadata["data_request"] = self.minimal_input.data_request
metadata["field_shape"] = self.minimal_input.field_shape
metadata["proj_string"] = self.minimal_input.proj_string
metadata["variables_metadata"] = self.minimal_input.variables_metadata

metadata["start_date"] = dates[0].isoformat()
metadata["end_date"] = dates[-1].isoformat()
Expand Down
38 changes: 37 additions & 1 deletion src/anemoi/datasets/create/input/result.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,31 @@
LOG = logging.getLogger(__name__)


def _fields_metatata(variables, cube):
assert isinstance(variables, tuple), variables

result = {}
for i, c in enumerate(cube.iterate_cubelets()):
assert c._coords_names[1] == variables[i], (c._coords_names[1], variables[i])
f = cube[c.coords]
md = f.metadata(namespace="mars")
if not md:
md = f.metadata(namespace="default")

if md.get("param") == "~":
md["param"] = f.metadata("param")
assert md["param"] not in ("~", "unknown"), (md, f.metadata("param"))

if md.get("param") == "unknown":
md["param"] = str(f.metadata("paramId", default="unknown"))
# assert md['param'] != 'unknown', (md, f.metadata('param'))

result[variables[i]] = md

assert i + 1 == len(variables), (i + 1, len(variables))
return result


def _data_request(data):
date = None
params_levels = defaultdict(set)
Expand Down Expand Up @@ -312,7 +337,10 @@ def _trace_datasource(self, *args, **kwargs):
def build_coords(self):
if self._coords_already_built:
return
from_data = self.get_cube().user_coords

cube = self.get_cube()

from_data = cube.user_coords
from_config = self.context.order_by

keys_from_config = list(from_config.keys())
Expand Down Expand Up @@ -359,11 +387,19 @@ def build_coords(self):
self._field_shape = first_field.shape
self._proj_string = first_field.proj_string if hasattr(first_field, "proj_string") else None

self._cube = cube

self._coords_already_built = True

@property
def variables(self):
self.build_coords()
return self._variables

@property
def variables_metadata(self):
return _fields_metatata(self.variables, self._cube)

@property
def ensembles(self):
self.build_coords()
Expand Down
5 changes: 4 additions & 1 deletion src/anemoi/datasets/create/persistent.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,11 @@ def items(self):
yield pickle.load(f)

def add_provenance(self, **kwargs):
path = os.path.join(self.dirname, "provenance.json")
if os.path.exists(path):
return
out = dict(provenance=gather_provenance_info(), **kwargs)
with open(os.path.join(self.dirname, "provenance.json"), "w") as f:
with open(path, "w") as f:
json.dump(out, f)

def add(self, elt, *, key):
Expand Down
5 changes: 4 additions & 1 deletion src/anemoi/datasets/create/statistics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,8 +187,11 @@ def __init__(self, dirname, overwrite=False):

def add_provenance(self, **kwargs):
self.create(exist_ok=True)
path = os.path.join(self.dirname, "provenance.json")
if os.path.exists(path):
return
out = dict(provenance=gather_provenance_info(), **kwargs)
with open(os.path.join(self.dirname, "provenance.json"), "w") as f:
with open(path, "w") as f:
json.dump(out, f)

def create(self, exist_ok):
Expand Down
6 changes: 5 additions & 1 deletion src/anemoi/datasets/create/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,11 @@ def reset(self, lengths):
return self.create(lengths, overwrite=True)

def add_provenance(self, name):
z = self._open_write()

if name in z.attrs:
return

from anemoi.utils.provenance import gather_provenance_info

z = self._open_write()
z.attrs[name] = gather_provenance_info()
5 changes: 4 additions & 1 deletion src/anemoi/datasets/data/forwards.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ def name_to_index(self):
def variables(self):
return self.forward.variables

@property
def variables_metadata(self):
return self.forward.variables_metadata

@property
def statistics(self):
return self.forward.statistics
Expand Down Expand Up @@ -253,7 +257,6 @@ def missing(self):
offset = 0
result = set()
for d in self.datasets:
print("--->", d.missing, d)
result.update(offset + m for m in d.missing)
if self.axis == 0: # Advance if axis is time
offset += len(d)
Expand Down
13 changes: 13 additions & 0 deletions src/anemoi/datasets/data/join.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,19 @@ def variables(self):

return result

@cached_property
def variables_metadata(self):
seen = set()
result = {}
for d in reversed(self.datasets):
for v in reversed(d.variables):
while v in seen:
v = f"({v})"
seen.add(v)
result[v] = d.variables_metadata[v]

return result

@cached_property
def name_to_index(self):
return {k: i for i, k in enumerate(self.variables)}
Expand Down
11 changes: 11 additions & 0 deletions src/anemoi/datasets/data/select.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,10 @@ def shape(self):
def variables(self):
return [self.dataset.variables[i] for i in self.indices]

@cached_property
def variables_metadata(self):
return {k: v for k, v in self.dataset.variables_metadata.items() if k in self.variables}

@cached_property
def name_to_index(self):
return {k: i for i, k in enumerate(self.variables)}
Expand Down Expand Up @@ -108,13 +112,20 @@ def __init__(self, dataset, rename):
super().__init__(dataset)
for n in rename:
assert n in dataset.variables, n

self._variables = [rename.get(v, v) for v in dataset.variables]
self._variables_metadata = {rename.get(k, k): v for k, v in dataset.variables_metadata.items()}

self.rename = rename

@property
def variables(self):
return self._variables

@property
def variables_metadata(self):
return self._variables_metadata

@cached_property
def name_to_index(self):
return {k: i for i, k in enumerate(self.variables)}
Expand Down
4 changes: 4 additions & 0 deletions src/anemoi/datasets/data/stores.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,10 @@ def variables(self):
)
]

@property
def variables_metadata(self):
return self.z.attrs.get("variables_metadata", {})

def __repr__(self):
return self.path

Expand Down
Loading

0 comments on commit 0a5fa5e

Please sign in to comment.