-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for for data validation syntax #104
base: main
Are you sure you want to change the base?
Conversation
@phackstock @gunnar-pik @Renato-Rodrigues @orichters @robertpietzcker - please let me know if this is a useful step towards automated validation of scenario submissions... |
Maybe for your inspiration, @pweigmann has worked on a similar approach with a config file that looks like this. I like the following features of our approach:
|
Thanks @orichters, yes, I've seen your format before and we want to develop in this direction too (and I hope that the yaml file is less heavy and more reliable for forward/backward compatibility).
This is already implemented where
I didn't consider it yet, but you can pass a "model" or "scenario" filter argument.
Very useful suggestions, to be implemented in the future. |
Thanks @danielhuppmann very useful! Yes, great to loop in @pweigmann, who started this for COMMITTED and will also be involved in the SCI project, and also @PhilippVerpoort, who will join SCI as well. |
Hello @danielhuppmann , always fascinating to see when different people come up with a similar solution to the same problem, it does invoke confidence that this type of tool can be useful! On the other hand, it also means a lot of parallel work in different languages, I suppose. You can follow the current development efforts of our validation tool here: https://github.com/pik-piam/piamValidation Don't hesitate to reach out in case you would like to exchange ideas or learn more about what we have done so far, I could see this being a great area for collaboration. |
Based on further discussions with @phackstock, I have modified the PR and the description (see at the top) to include a way to import a csv file but minimize duplication of columns/rows. I also switched from |
Looks very good to me, would be happy to implement it like this. If we wanted (which I'm not sure we do) we could try to make the syntax of the validation file more compact.
- Emissions|CO2|Energy and Industrial Processes:
- region: World
rtol: 5%
file: data_emissions_global.csv
- region: Asia (R5)
year: 2020
rtol: 10%
value: 20520 This would save 3 lines compared to the current proposal. If it makes readability worse we should stick to the current format though. |
Thanks @phackstock - I'm hesitant to define any dimension implicitly: first, I think it's better for readibility to always write "variable: ...", and second, we may run into a use case where the variable is not the primary sorting dimension, which will then make life difficult... |
@danielhuppmann fair point about the variable. Regarding your point on having a use case where the variable is not the main dimension I'm not sure if we'd want to put everything into the same file anyway. If we're trying to make one format that fits every possible use case I'm afraid we'd end up with something pretty unwieldy. What do you think about my second point of moving the constraints into a list rather than having to give them names? - Historical fossil CO2 emissions data:
variable: Emissions|CO2|Energy and Industrial Processes
constraints:
- region: World
rtol: 5%
file: data_emissions_global.csv
- region: Asia (R5)
year: 2020
rtol: 10%
value: 20520 instead of: - Historical fossil CO2 emissions data:
variable: Emissions|CO2|Energy and Industrial Processes
World:
region: World
rtol: 5%
file: data_emissions_global.csv
Asia (R5):
region: Asia (R5)
year: 2020
rtol: 10%
value: 20520 to me, using |
Short note: I think it is important that we can use multiple threshold levels, especially as we go to the vetting of neart-term projections - higher and lower, and also soft ones (yellow traffic light) and hard constraints (red traffic light). So would this be added as lim_lower_yellow, lim_upper_red or similar? |
This PR proposes a syntax for data validation as part of the scenario-processing infrastructure.
This PR is intended as a minimum viable product for scenario data validation. This feature is not yet supported by the nomenclature package, but will be added as a new class DataValidator once we reach agreement about the syntax.
The proposed syntax tries to strike a balance between readability and flexibility, using a nested yaml-style syntax to define
Any datapoint in an IAMC-style timeseries format matching the given filters must satisfy the bounds, otherwise an error is raised. The structure directly matches the signature of the method IamDataFrame.validate() so that the implementation can build on the existing functionality. For simplicity, alternative kwargs (value, rtol) will be added to the
validate()
method for more direct configuration.The syntax works as follows:
file
attribute in the yaml dictionary to import validation attributes from a csv file (with#
as comment)This structure will yield four validation items:
The name could be used when reporting failed validation of a scenario.
Going forward, we can also implement more features