-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cab specification 2.0 #637
Comments
First of all, as a thought experiment, I suggest we try to formulate this as a generic YML validator that can replace PyKwalify, so that we can use it for Caracal too. Things like the |
Agreed, but let's just do this. It will simplify the integration of stimela and caracal
In principle yes. |
Second, a major feature I would like to see is include tags, to include YML files into YML files. There are some straightforward implementations discussed here, we could just reuse them or cut and paste: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another Use cases for includes:
|
I volunteer to get a start on this, since I'm the one who seems to be accumulating the requirements... |
Here is a first draft. The include tags don't really work for what we want, but we can just have an include section in the definition schema: 2.0.0
task: casa_gaincal
base: stimela/casa:1.6.9
binary: giancal
description: Determine temporal gains from calibrator observations
prefix: null
include_inputs:
- ../misc/casagain.yml # this file contains parameters that are common between gain calibration tasks
include_outputs:
- ../misc/casagain.yml
inputs:
gaintype:
info: Type of gain solution
type: str
default: G
choices:
- G
- T
- GSPLINE
- K
- KCROSS
- XY+QU
- XYf+QU
splinetime:
info: Spline timescale(sec); All spw's are first averaged.
type: float
default: 3600.0
npointaver:
info: The phase-unwrapping algorithm
type: int
default: 3
phasewrap:
info: Wrap the phase for jumps greater than this value (degrees)
type: float
default: 180.0
smodel:
info: Point source Stokes parameters for source model.
type:
- List[str]
- List[float]
save_result:
info: Name of pickle file to save the task result into
type: str
outputs:
pickled_gains:
name: "{inputs.save_result.value}"
location: "{outdir}"
required: false |
OK, so you're suggesting that any But then why not do the same via an inputs:
include_yml:
- ../misc/casagain.yml
gaintype:
info: Type of gain solution
type: str Or perhaps (I still want to be able to pull specific sections in): inputs:
include_yml:
path: ../misc/casagain.yml
section: section_name Also, I would suggest supporting "{}"-substitutions inside included files. I.e. if inputs:
include_yml:
path: ../misc/casagain.yml
section: section_name
vars:
foo: 1
bar: 2 |
Yes
This is more versatile, especially if we want to use the same specification for caracal.
Agreed. I should've made this explicit in my previous comment, but you can also substitute some stimela variables like |
Yeah, I only saw that after I posted. That's a very useful feature (see e.g. CubiCal's default --j-save-to settings). BTW, if you put |
Also better to specify positional parameters in the YAML inputs:
gaintype:
positional: false # <----
info: Type of gain solution
type: str |
@paoloserra maybe you want to have some input here as we are thinking of also replacing pykwalify in caracal with this specification. |
@SpheMakh I've had a look but I'm not sure what exactly (if anything) we would need to change in the caracal schema files. |
Well the schema should be backwards-compatible for the most part, but it should allow for more options and flexibility. What sort of limitations were we running into with the caracal schemas? Personally, I think the Off the top of my head I can think of permitting a true |
So I'm not aware of significant limitations. We had to use |
Something that often takes too much time when making a new worker is writing the parameters, which are already defined in stimela. With this, we can just use the |
I'm probably missing something, but CARACal has its own parameter names and default values, which do not always coincide with those in Stimela. And anyway, very few parameters are actually required in CARACal, most workers can be added to a pipeline run adding very few parameters to the CARACal config file. This is shown by https://github.com/caracal-pipeline/caracal/blob/master/caracal/sample_configurations/minimalConfig.yml . So I think we need to be careful about transferring Stimela par names / default values to CARACal. |
I think for future packages (and perhaps Caracal R2.0), it would make a lot of sense to lean on include heavily. Some of the current implementations I find annoying, confusing and error-prone. For example, it would be a lot "cleaner" if wsclean settings were their own subsection in selfcal, something like: selfcal:
imager: wsclean
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
robust: -1 # overriding whatever was pulled from stimela This way Caracal could be free to have different and perhaps more sensible defaults (and the {}-subst mechanism would allow for various games with the remapping of parameter names), but at the same time there would be a consistent 1:1:1 mapping between wsclean parameter names, Stimela parameter names, and settings in the appropriate subsection, which makes life easier for everyone, no? (CubiCal and the coming QuartiCal is another package-of-100-parameters that could benefit from a consistent naming mechanism...) Another thing that flag:
include_yml:
path: caracal/meerkat-flagging.yml
section: aggresive_4k_flagging
inspect:
include_yml:
path: caracal/meerkat-inspect.yml
section: detailed_1gc_plots Not saying things should literally be implemented like this -- just throwing food for thought out there. |
I don't know about parameter name matching. Stimela tends to stick to the native parameter names. CARACal does not need to. In fact, we've changed most of the parameter names in order to have some internal CARACal consistency. So I'd say a Stimela-Caracal match is never going to happen. I fully agree that the CARACal Overall there's plenty to improve in CARACal, and it's good to think of this for a 2.0 release. But I also think CARACal should target users who know nothing about Stimela rather than Stimela developers. So parameter names, nesting etc etc should not follow Stimela's unless that's what makes more sense for a CARACal user. |
Well, it's not about following Stimela naming. Rather, for complicated packages such as WSClean, DDF and CubiCal, I'd like to (ideally) see a system where Stimela names are consistent with the native package parameter names, and there's a way to import this schema into Caracal "as is". Ideally, users who know WSClean should be able to piss out Caracal configs at will without having to look up what we decided to rename each parameter to. If you're going to give the user rope, make it familiar rope. This does not have to preclude a simple and self-consistent subset of parameters to buttress the Caracal newbie. |
I also lean towards keeping the names the same as the packages, but this will be optional. We can have an optional selfcal:
imager: wsclean
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
pixel:
info: Image size in pixels
overrides: size # <--------------
.
.
.
|
Thinking about this a bit more, it would be annoying to have to change standard imaging parameters names like |
But that divergence can be accomplished via override and {}-substitutions. E.g.: selfcal:
cellsize:
...
img_wsclean:
include_yml:
path: stimela/wsclean.yml
section: parameters
scale: # "scale" is a wsclean-specific paramater
default: {selfcal.cellsize} # ...which we set from a Caracal parameter by default (but the end user can still override)
|
Yes, that was one of the reasons. |
Actually, I don't think we need to reinvent the wheel here. Both includes and substitutions (of a sort) are already standard YaML features. See merge and anchors, https://yaml.org/type/merge.html. The somewhat annoying thing is that anchors have to be explicitly set with "&" at point of origin. And it knows nothing about including files -- the YaML code in the file has to be loaded in the stream before. So the example above would have to look something like selfcal:
cellsize: &selfcal_cellsize
...
img_wsclean:
<<: *stimela__cabs__wsclean__parameters
scale:
default: *selfcal_cellsize (NB, this is cartoon code, it can't actually work like that, but I don't want to cloud the picture with details...) There may be a way to automatically set anchor names if we use
In principle, this is a generic pattern to support a layer of config files Y that can reference sections from a layer of config files X via automatically-generated anchor names. You can keep on adding layers, the only limitation is that each layer can reference auto-generated anchors from the previous layers, but not from within itself. |
This sounds complicated, but I think I can make a simple proof-of-concept on am experimental Stimela branch. I think Stimela itself needs this: there's currently a lot of repetition in the cabs parameters (for example, the entire casa_xxx family has many common parameters) which could be eliminated by defining common stuff in config files under |
OK, I got tired of fighting Caracal and data and flags, so I spent a day or so hacking on a couple of proofs of concept. Checked it into the Approach 1: anchors and aliasesThis uses anchors and aliases, as discussed above, plus some extra code to automatically set anchors, and merge YML configs. See here, and run as I've made example Pros: the syntax is nice and clean, and is a standard YML feature, e.g.: inputs:
<<: *stimela-base-casa-mssel_inputs
docallib:
info: "Use callib or traditional cal apply parameters"
type: bool
default: false Cons: the functionality is limited, and it all feels a bit klunky. Also, Approach 2: use OmegaConfThis is a neat new package, and provides almost all the features we need (plus it's evolving quickly, so will probably have the rest soon). It's a lot more powerful than that: you can specify a schema programmatically using the standard Python The Pros: powerful and elegant schemas, explicitly designed for the situation where a config is spread across many files and we want to merge them all together, built-in interpolation ( Cons: some schema errors have been utterly incomprehensible (if you thought pykwalify was bad...) It is also an option to use omegaconf without the schema mechanism (essentially, it's then a fancy YML reader providing variable interpolation), and then on top of that use something like pykwalify to process and validate the config afterwards. |
After a discussion with the developer, I realized this is something relatively simple to bolt on externally, so I checked in a modified demo. This works and looks quite nice now: inputs:
_use:
- base.casa.params.mssel_inputs
docallib:
info: "Use callib or traditional cal apply parameters"
type: bool
default: false I think this sort of thing would be a very nice addition to Caracal R2. I'm finding my overall collection of config files to be very repetitive and error-prone. I'd really like to develop a library of standard config snippets, which could be pulled in piecewise, and then only the necessary parameters modified. Something like: crosscal:
_use: lib.meerkat.crosscal.4k
secondary:
order: KGIAKF
inspect:
_use: lib.meerkat.plots.cal.phaseballs, lib.meerkat.plots.cal.spectra, lib.meerkat.plots.cal.detailed
shadems:
plots:
my_extra_plots:
- some_extra_plots P.S. This needs omegaconf master to run, as it uses some features from the upcoming 2.1 release. |
@o-smirnov You mention pykwalify vs typing. I was wondering if overloading a parameter in a schema, by which I mean allowing it to be multiple types, e.g.:
all allowed, but not a float, a list of floats, a list of n+1 or n-1 strings etc. |
Good question, but it's turning into a Caracal discussion, so let's discuss in in caracal-pipeline/caracal#1303 rather. |
I am not particularly concerned where a validation takes place, so this is good for me, but in principle this could (should?) happen as early as possible, i.e. Stimela (where the question turned up for me). |
I go for approach 2. OmegaConf is exactly what we need |
Yes, this is the direction we should move. For example, the casa base file I use is below. Let me incorporate the mergeConf shindig to my branch vis:
type: Directory
info: Name of input visibility file
default: null
required: true
positional: false
prefix: null
field:
info: Select field using field id(s) or field name(s)
type: str
default: null
spw:
info: Select spectral window/channels
type : str
default: null
selectdata:
info: Other data selection parameters
type: bool
default: true
timerange:
info: Select data based on time range
type: str
default: null
uvrange:
info": Select data within uvrange (default units meters)
dtype: str
default: null
antenna:
info: Select data based on antenna/baseline
type: str
default: null
scan:
info: Scan number range
type: str
default: null
observation:
info: Select by observation ID(s)
type: str
default: null
msselect:
info: Optional complex data selection (ignore for now)
type: str
default: null |
That's almost word-for-word what I wrote for the omegaconf demo: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/cargo/base/casa/casa.yaml |
OK then let's start a branch and get moving. Setting up the infrastructure is very little work, I think most of it is already in my demo. Porting all the Implementing #692 will also be pretty straightforward then. |
Let's continue in your configuratt branch |
OK cool, well I moved all the generic config merging infrastructure into the configuratt module, the init now is as simple as it can be: https://github.com/ratt-ru/Stimela/blob/configuratt/stimela/configuratt/demo_omega.py#L109. If you check the schemas and start working on the image and cab ymls, I volunteer to work on #692. Also, #598 is naturally merged into this... |
This is probably better served by a discussion thread: #696 |
And we should include #598 in these deliberations. |
The cab specification has a few obvious flaws. These ones are particularly bad:
parameters
list was specified as:The text was updated successfully, but these errors were encountered: