-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global attributes #23
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start. There are a few things major changes and several minor ones I think would be good to address:
Major Changes
- At the moment, you are accessing a module-level dictionary,
data
. I would prefer that this comes from the user somehow, so you would need to integrate it into theCMORizer
object and pass the that to theRule
. We already haveassign_data_request_to_rule
(or something like that) so you can copy that behaviour. - I would, where possible, move the logic to extract global level attributes that you get from the
DataRequestVariable
directly to theDataRequestVariable
object. In the end, something like this would be nice:
>>> ds = xr.Dataset(...)
>>> rule = Rule(...)
>>> rule.data_request_variable
<DataRequestVariable object at ...>
>>> drv = rule.data_request_variable
>>> ds.attrs.update(drv.get_global_attributes())
Minor Changes
- Docstrings for many of the functions are missing
- You committed some files with your working notes. We don't want those in the repository at the end, or, they should be included in the user handbook.
- Some changes in the YAML files where you changed a key (e.g.
source_id
), still have the old value as a comment. This makes the diffs unclean, so I would rather you just directly change things. - The inclusion in the standard pipeline should happen before manual checkpointing. The manual checkpoint trigger should be directly before computation is called.
Co-authored-by: Paul Gierz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on the right track, I have some improvement suggestions.
There were some comments from my last review that need to be addressed still, too.
def get_global_attributes(self): | ||
return self.ga.get_global_attributes(self.rule_attrs, self.table_header) | ||
|
||
def set_global_attributes(self, cv, rule_attrs): | ||
self.ga = GlobalAttributes(cv) | ||
self.rule_attrs = rule_attrs | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about for CMIP7DataRequestVariable
and the generic DataRequestVariable
stubs? At the moment, this is only implemented for CMIP6, so I would at least like to see some placeholders in CMIP7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick look at CMIP7 CVs (https://github.com/WCRP-CMIP/CMIP7_CVs/), it is unclear at the moment how the CVs are exposed to the user. At this stage, the structure of json files and how the data is organised looks different from CMIP6 CVs and so the code that reads and loads CIMP6 CVs does not work CMIP7.
Coming back to the your question, for CMIP7 I could add get_global_attributes
and set_global_attributes
methods in CMIP7DataRequestVariable
which invoked will raise NotImplemented
error.
src/pymorize/global_attributes.py
Outdated
d = {name: int(val) for name, val in d.groupdict().items()} | ||
return d | ||
|
||
def _source_id_related(self, rule: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename the argument rule
to be something more specific here to avoid confusion. You're passing in the dictionary of extracted attributes from the rule, not the rule itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this attributed is renamed to attrs_map_on_rule
to avoid confusion.
def set_global_attributes(self, ds, rule): | ||
""" | ||
Set global attributes on a dataset based on the given rule. | ||
|
||
Parameters | ||
---------- | ||
ds : xr.Dataset | ||
The dataset to set the global attributes on. | ||
rule : DataRequestRule | ||
The data request rule to use to set the attributes. | ||
|
||
Returns | ||
------- | ||
ds : xr.Dataset | ||
The dataset with the global attributes set. | ||
""" | ||
d = self.get_global_attributes(rule) | ||
ds.attrs.update(d) | ||
return ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right place for this. After all, steps are just regular functions...here you have a method of the GlobalAttributes
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this function needs to go away especially after the integration of global attributes into the framework. I will remove it. To replace this functionality, I guess there must be another piece of code as a part of this module or another module that should be used in the pipeline to set the global attributes
src/pymorize/global_attributes.py
Outdated
d = {name: int(val) for name, val in d.groupdict().items()} | ||
return d | ||
|
||
def _source_id_related(self, rule: dict) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a doc-string and some examples here, this is a big enough function to deserve some testing.
src/pymorize/global_attributes.py
Outdated
assert model_component in model_components | ||
else: | ||
raise ValueError("Missing required attribute 'model_component'") | ||
grid = model_components[model_component]["description"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about this one? grid
is the model component's description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bit unclear what should be the value for this from the documentation. This needs to be cross-checked with previously generated output from (maybe) seamore tool.
The `next(iter(thing))` is to get the first element from the thing. The simpler way to get the first element is simply index accessing `thing[0]` but over time my style of programming has moved from away from index access to iter based thing basically to avoid hardcoding 0 or 1 or 2. May be it is matter of taste but more importantly I like the way it is re-written sounds more meaningful to follow the code and I like it. Co-authored-by: Paul Gierz <[email protected]>
else: | ||
if len(inst_id) > 1: | ||
# No user-provided institution_id; infer it | ||
if len(cv_institution_ids) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make it abundantly clear here and do something like:
if len(cv_institution_ids) > 1: | |
if isinstance(inst_id, list) and len(cv_institution_ids) > 1: |
Lower down, you could also do an instance check for list. That would avoid problems with you accidentally getting a string, or something like that.
...just an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cv_institution_ids
is the institution_ids from CMIP6 CVs. To do a isinstance
check for list
in other terms mean checking the schema defined in json for this field. This can be done for all the fields but it steers the focus on a different direction. What make sense is to have module that does schema checking but should that be a part of pymorize. Data source schema validation. plz comment
addresses issue #16