CorrectionSet containing CorrectionSet #55

IzaakWN · 2021-03-22T09:42:22Z

@nsmith- @gouskos
In the XPOG README.md you see that they expect a structure something like this, right?

CorrectionSet:year -> CorrectionSet:sf

In this way you could load one Corrections object like cset[year][sf], where sf for the TauPOG is something like DeepTauVSjet, DeepTauVSmu, tau_energy_scale, or whatever.
The only problem is that currently, it does not seem like a correction set can contain a list of correction sets:

year1 = CorrectionSet(schema_version=2,corrections=[corr1,corr2])
year2 = CorrectionSet(schema_version=2,corrections=[corr3,corr4])
cset = CorrectionSet(schema_version=2,corrections=[year1,year2])

Also see

correctionlib/src/correctionlib/schemav2.py

Line 217 in 9c8d10f

corrections: List[Correction]

and

correctionlib/src/correction.cc

Lines 450 to 455 in 9c8d10f

    
           if ( const auto& items = getOptional<rapidjson::Value::ConstArray>(json, "corrections") ) { 
        
             for (const auto& item : *items) { 
        
               auto corr = std::make_shared<Correction>(item); 
        
               corrections_[corr->name()] = corr; 
        
             } 
        
           }

The problem is that each type of correction can have totally different inputs, and you want to avoid to pass to the evaluator the full list of inputs that covers all cases, nor do you want to load the whole collection of SF, if you only need one or two. I see two possible solutions:

Correction sets can contain correction sets, so you load a type of correction via cset[year][sf].
Otherwise, we should not group different types of corrections into one big JSON file in the XPOG repo.

The text was updated successfully, but these errors were encountered:

gouskos · 2021-03-22T11:49:42Z

@IzaakWN @nsmith-
I agree that this would be a nice functionality. Between the 2 options you suggest, I would probably be in favour of 1.

nsmith- · 2021-03-22T14:45:43Z

Option 1 implies a schema change. Are the sets to be arbitrarily deep? I originally thought we might have the year be an input, assuming that for a given correction the inputs would be the same or very similar across run years?
Of course there is option 3, putting the year in the name. I don't think we want that though :)

gouskos · 2021-03-22T17:40:04Z

@IzaakWN @nsmith- concerning this: "want to load the whole collection of SF, if you only need one or two"
Does loading the whole collection makes a significant difference?

gouskos · 2021-03-22T17:43:44Z

Option 1 implies a schema change. Are the sets to be arbitrarily deep? I originally thought we might have the year be an input, assuming that for a given correction the inputs would be the same or very similar across run years?

If I understand correctly, the use cases that @IzaakWN would like to capture is for example, a correction in one year needs additional input variables, right?

Of course there is option 3, putting the year in the name. I don't think we want that though :)

Yes - I agree :)

IzaakWN · 2021-03-24T11:35:01Z

@nsmith-

Option 1 implies a schema change.

Yes, some type of base class or dynamic casting?

Are the sets to be arbitrarily deep?

No, in our own use case it would be only two layers: year and then "SF type", see below.

@gouskos

Does loading the whole collection makes a significant difference?

No, at least for tau corrections, it is negligible.

What is more important is that the TauPOG has many different types of corrections (SFs for DeepTauVSjet, SFs for DeepTauVSe, trigger SFs, tau energy scales, ...), that depend on different sets of tau variables (pt, eta, decay mode, ...) [1]. If we want one single JSON file tau.json stores one correction object per year we need to group these completely different types into one correction object.

Option 3 is similar to option 2, in which case I would do it similar to BTV and JME, who split their JSONs into different jet types: One JSON file per correction type, which is a correction set that contains one correction object per year:

DeepTau2017v2p1VSjet.json
DeepTau2017v2p1VSe.json
DeepTau2017v2p1VSe.json
MVAoldDM2017v2.json
...
tau_energy_scale.json

Users load it as

cset1 = load("DeepTau2017v2p1VSjet.json")
cset2 = load("tau_energy_scale.json")
corr1 = cset1["2018_UL"]
corr2 = cset2["2018_UL"]

Without the year in the name and because we can now merge certain SF thanks to the versatile JSON schema, this is already a significant reduction in the number of files that we had before in our repos [2–4], but If we don't want this, I would propose option 1, where users load as follows:

cset = load("tau.json")
corr1 = cset["2018_UL"]["DeepTau2017v2p1VSjet"]
corr2 = cset["2018_UL"]["tau_energy_scale"]

Both have the same downside of what has been discussed before: Analyzers need to match the right string in either the filename or in the key. However, this seems unescapable to me, because they anyway need to know which tau ID or what type of correction to select. The question is if you want to show it in the file name, or "hide" it in the correction set so users have to browse for it. My personal preference would be option 2.

[1] Slide 11 in https://indico.cern.ch/event/1020470/#2-cms-universal-json-format-fo

[2] Slide 20 in https://indico.cern.ch/event/1020470/#2-cms-universal-json-format-fo

[3] https://github.com/cms-tau-pog/TauIDSFs/tree/master/data

[4] https://github.com/cms-tau-pog/TauTriggerSFs/tree/master/data

nsmith- · 2021-04-08T14:57:30Z

I'm making this a v2 item before we release the final version. My reluctance to just immediately put

class CorrectionSet(Model):
    schema_version: Literal[VERSION] = Field(description="The overall schema version")
    corrections: List[Union[Correction, CorrectionSet]]

is that if we ever do end up with some sort of database that can be queried, now the query key needs to be able to support some sort of nesting syntax. Perhaps we forbid / in a Correction name now and reserve that as a delimiter? That might be a nice shorthand then, e.g. cset["2018_UL/tau_energy_scale"] instead of cset["2018_UL"]["tau_energy_scale"]. But I'm not totally convinced we cannot just do this by having several folders in cvmfs/whatever filesystem. If the expectation is that users will only ever access one type of several, it makes more sense to keep them in separate files than glue them all together.

IzaakWN · 2021-04-28T14:35:03Z

Just for the record: We discussed this issue in XPOG, and opted for "Option 4", which is using subdirectories in the XPOG GitLab repo like,

POG/TAU/2016preVFB_UL/tau.json
POG/TAU/2016postVFB_UL/tau.json
...
POG/TAU/2018_UL/tau.json

which is a fast and elegant alternative to nested CorrectionSet objects: https://indico.cern.ch/event/1019415/#6-common-json-format-tau-pog

However, I still like the idea of nested CorrectionSet as a feature in the future if we ever get around to it. :p

nsmith- · 2021-04-28T14:42:33Z

Ok, it may well be that we find nested sets useful after gaining experience. But at least for now let's move it to a later schema version.

nsmith- added enhancement New feature or request schema Issues related to the schema definition labels Apr 8, 2021

nsmith- added this to the Schema v2 milestone Apr 8, 2021

nsmith- removed this from the Schema v2 milestone Apr 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CorrectionSet containing CorrectionSet #55

CorrectionSet containing CorrectionSet #55

IzaakWN commented Mar 22, 2021 •

edited

Loading

gouskos commented Mar 22, 2021

nsmith- commented Mar 22, 2021

gouskos commented Mar 22, 2021

gouskos commented Mar 22, 2021

IzaakWN commented Mar 24, 2021

nsmith- commented Apr 8, 2021 •

edited

Loading

IzaakWN commented Apr 28, 2021

nsmith- commented Apr 28, 2021

CorrectionSet containing CorrectionSet #55

CorrectionSet containing CorrectionSet #55

Comments

IzaakWN commented Mar 22, 2021 • edited Loading

gouskos commented Mar 22, 2021

nsmith- commented Mar 22, 2021

gouskos commented Mar 22, 2021

gouskos commented Mar 22, 2021

IzaakWN commented Mar 24, 2021

nsmith- commented Apr 8, 2021 • edited Loading

IzaakWN commented Apr 28, 2021

nsmith- commented Apr 28, 2021

IzaakWN commented Mar 22, 2021 •

edited

Loading

nsmith- commented Apr 8, 2021 •

edited

Loading