Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CorrectionSet containing CorrectionSet #55

Open
IzaakWN opened this issue Mar 22, 2021 · 8 comments
Open

CorrectionSet containing CorrectionSet #55

IzaakWN opened this issue Mar 22, 2021 · 8 comments
Labels
enhancement New feature or request schema Issues related to the schema definition

Comments

@IzaakWN
Copy link
Contributor

IzaakWN commented Mar 22, 2021

@nsmith- @gouskos
In the XPOG README.md you see that they expect a structure something like this, right?

CorrectionSet:year -> CorrectionSet:sf

In this way you could load one Corrections object like cset[year][sf], where sf for the TauPOG is something like DeepTauVSjet, DeepTauVSmu, tau_energy_scale, or whatever.
The only problem is that currently, it does not seem like a correction set can contain a list of correction sets:

year1 = CorrectionSet(schema_version=2,corrections=[corr1,corr2])
year2 = CorrectionSet(schema_version=2,corrections=[corr3,corr4])
cset = CorrectionSet(schema_version=2,corrections=[year1,year2])

Also see

corrections: List[Correction]
and
if ( const auto& items = getOptional<rapidjson::Value::ConstArray>(json, "corrections") ) {
for (const auto& item : *items) {
auto corr = std::make_shared<Correction>(item);
corrections_[corr->name()] = corr;
}
}

The problem is that each type of correction can have totally different inputs, and you want to avoid to pass to the evaluator the full list of inputs that covers all cases, nor do you want to load the whole collection of SF, if you only need one or two. I see two possible solutions:

  1. Correction sets can contain correction sets, so you load a type of correction via cset[year][sf].
  2. Otherwise, we should not group different types of corrections into one big JSON file in the XPOG repo.
@gouskos
Copy link

gouskos commented Mar 22, 2021

@IzaakWN @nsmith-
I agree that this would be a nice functionality. Between the 2 options you suggest, I would probably be in favour of 1.

@nsmith-
Copy link
Collaborator

nsmith- commented Mar 22, 2021

Option 1 implies a schema change. Are the sets to be arbitrarily deep? I originally thought we might have the year be an input, assuming that for a given correction the inputs would be the same or very similar across run years?
Of course there is option 3, putting the year in the name. I don't think we want that though :)

@gouskos
Copy link

gouskos commented Mar 22, 2021

@IzaakWN @nsmith- concerning this: "want to load the whole collection of SF, if you only need one or two"
Does loading the whole collection makes a significant difference?

@gouskos
Copy link

gouskos commented Mar 22, 2021

Option 1 implies a schema change. Are the sets to be arbitrarily deep? I originally thought we might have the year be an input, assuming that for a given correction the inputs would be the same or very similar across run years?

If I understand correctly, the use cases that @IzaakWN would like to capture is for example, a correction in one year needs additional input variables, right?

Of course there is option 3, putting the year in the name. I don't think we want that though :)

Yes - I agree :)

@IzaakWN
Copy link
Contributor Author

IzaakWN commented Mar 24, 2021

@nsmith-

Option 1 implies a schema change.

Yes, some type of base class or dynamic casting?

Are the sets to be arbitrarily deep?

No, in our own use case it would be only two layers: year and then "SF type", see below.

@gouskos

Does loading the whole collection makes a significant difference?

No, at least for tau corrections, it is negligible.

What is more important is that the TauPOG has many different types of corrections (SFs for DeepTauVSjet, SFs for DeepTauVSe, trigger SFs, tau energy scales, ...), that depend on different sets of tau variables (pt, eta, decay mode, ...) [1]. If we want one single JSON file tau.json stores one correction object per year we need to group these completely different types into one correction object.

Option 3 is similar to option 2, in which case I would do it similar to BTV and JME, who split their JSONs into different jet types: One JSON file per correction type, which is a correction set that contains one correction object per year:

DeepTau2017v2p1VSjet.json
DeepTau2017v2p1VSe.json
DeepTau2017v2p1VSe.json
MVAoldDM2017v2.json
...
tau_energy_scale.json

Users load it as

cset1 = load("DeepTau2017v2p1VSjet.json")
cset2 = load("tau_energy_scale.json")
corr1 = cset1["2018_UL"]
corr2 = cset2["2018_UL"]

Without the year in the name and because we can now merge certain SF thanks to the versatile JSON schema, this is already a significant reduction in the number of files that we had before in our repos [2–4], but If we don't want this, I would propose option 1, where users load as follows:

cset = load("tau.json")
corr1 = cset["2018_UL"]["DeepTau2017v2p1VSjet"]
corr2 = cset["2018_UL"]["tau_energy_scale"]

Both have the same downside of what has been discussed before: Analyzers need to match the right string in either the filename or in the key. However, this seems unescapable to me, because they anyway need to know which tau ID or what type of correction to select. The question is if you want to show it in the file name, or "hide" it in the correction set so users have to browse for it. My personal preference would be option 2.

[1] Slide 11 in https://indico.cern.ch/event/1020470/#2-cms-universal-json-format-fo

[2] Slide 20 in https://indico.cern.ch/event/1020470/#2-cms-universal-json-format-fo

[3] https://github.com/cms-tau-pog/TauIDSFs/tree/master/data

[4] https://github.com/cms-tau-pog/TauTriggerSFs/tree/master/data

@nsmith- nsmith- added enhancement New feature or request schema Issues related to the schema definition labels Apr 8, 2021
@nsmith- nsmith- added this to the Schema v2 milestone Apr 8, 2021
@nsmith-
Copy link
Collaborator

nsmith- commented Apr 8, 2021

I'm making this a v2 item before we release the final version. My reluctance to just immediately put

class CorrectionSet(Model):
    schema_version: Literal[VERSION] = Field(description="The overall schema version")
    corrections: List[Union[Correction, CorrectionSet]]

is that if we ever do end up with some sort of database that can be queried, now the query key needs to be able to support some sort of nesting syntax. Perhaps we forbid / in a Correction name now and reserve that as a delimiter? That might be a nice shorthand then, e.g. cset["2018_UL/tau_energy_scale"] instead of cset["2018_UL"]["tau_energy_scale"]. But I'm not totally convinced we cannot just do this by having several folders in cvmfs/whatever filesystem. If the expectation is that users will only ever access one type of several, it makes more sense to keep them in separate files than glue them all together.

@IzaakWN
Copy link
Contributor Author

IzaakWN commented Apr 28, 2021

Just for the record: We discussed this issue in XPOG, and opted for "Option 4", which is using subdirectories in the XPOG GitLab repo like,

POG/TAU/2016preVFB_UL/tau.json
POG/TAU/2016postVFB_UL/tau.json
...
POG/TAU/2018_UL/tau.json

which is a fast and elegant alternative to nested CorrectionSet objects: https://indico.cern.ch/event/1019415/#6-common-json-format-tau-pog

However, I still like the idea of nested CorrectionSet as a feature in the future if we ever get around to it. :p

@nsmith- nsmith- removed this from the Schema v2 milestone Apr 28, 2021
@nsmith-
Copy link
Collaborator

nsmith- commented Apr 28, 2021

Ok, it may well be that we find nested sets useful after gaining experience. But at least for now let's move it to a later schema version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request schema Issues related to the schema definition
Development

No branches or pull requests

3 participants