Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design] Enable Pointer logic for input/output #80

Closed
Freymaurer opened this issue Jun 6, 2023 · 7 comments
Closed

[Design] Enable Pointer logic for input/output #80

Freymaurer opened this issue Jun 6, 2023 · 7 comments

Comments

@Freymaurer
Copy link
Contributor

@HLWeil has this written out in his fork of ISA specification.

@Freymaurer
Copy link
Contributor Author

ISA-tools/isa-specs#15

@HLWeil
Copy link
Member

HLWeil commented Jun 9, 2023

Selection of the selectors

As stated in the PR linked above, Selectors defined by W3 could be a standardized solution.
Here there are in principal two sets of selectors defined:

  1. Fragment selectors, which are the selectors used in URIs and IRIs (The part in the adress after the #)
Name Fragment Specification Description
HTML http://tools.ietf.org/rfc/rfc3236 [rfc3236] Example: namedSection
PDF http://tools.ietf.org/rfc/rfc3778 [rfc3778] Example: page=10&viewrect=50,50,640,480
Plain Text http://tools.ietf.org/rfc/rfc5147 [rfc5147] Example: char=0,10
XML http://tools.ietf.org/rfc/rfc3023 [rfc3023] Example: xpointer(/a/b/c)
RDF/XML http://tools.ietf.org/rfc/rfc3870 [rfc3870] Example: namedResource
CSV http://tools.ietf.org/rfc/rfc7111 [rfc7111] Example: row=5-7
Media http://www.w3.org/TR/media-frags/ [media-frags] Example: xywh=50,50,640,480
SVG http://www.w3.org/TR/SVG/ [SVG11] Example: svgView(viewBox(50,50,640,480))
EPUB3 http://www.idpf.org/epub/linking/cfi/epub-cfi.html [cfi] Example: epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)
  1. Various other selectors

Fragment selectors are naturally part of an IRI, so they are all defined to have a single string syntax. The other selectors found in the document can have a more complex syntax with different fields.
As we will have to include the selector in our tabular view (e.g. isa.assay.xlsx file), it would definitely be easier to stick to those which can be described using a single string (So we don't need different alternative columns).

Just from my personal experience, the most important data formats which follow some generic syntax are tabular and xml based. These would be covered by the selectors shown above. For more complex formats which don't follow any generic logic, the plain text selector could be used. Unfortunately there is no fragment selector for binary files. For this, the data position selector exists, which does not define a string based representantion. But maybe it could just be handled like the plain-text selector?

Maybe json might also be an important target. Not sure if XPointer logic might be recycled for this?

General list

This would lead to the following selection of selectors:

  • CSV (Tabular)
  • XML (via XPointer, maybe also json)
  • Plain-Text (for complex formats)
  • Data (for binaries, would be an extension of the specification)

CSV (Tabular) constraints

Unfortunately, the CSV fragment specification does only define index based selection:

e.g.:

  • row=5-7
  • col=1-2
  • cell=4,1-6,2

In many cases, we would probably want to annotate columns and rows using the key strings.

e.g.:

  • row="MySample1"

This would necessitate extending the specificitation linked above with thorough and accessible information.

@HLWeil
Copy link
Member

HLWeil commented Jun 9, 2023

Format

Recently, we decided on denoting the input and output columns using the Input [Type] and Output [Type] syntax. So in an assay, where an Raw Data is the output we would have the following column Output [Raw Data].

With the addtion of selectors, this could further be qualified with an Output Selector column. E.g. in the case of tabular data, we could use the CSV fragment selector to select a specific column:

This might then look like this:
image

A thing to consider is whether to make a term out of the selector. The problem here is, that -in contrast to the isa headers- the value will always be a string. Kinda odd:
image

@HLWeil
Copy link
Member

HLWeil commented Jun 9, 2023

Any input would be welcome, @Freymaurer, @omaus, @kMutagene, @muehlhaus, @ZimmerD, @kappe-c, @Brilator

@ZimmerD
Copy link

ZimmerD commented Jun 12, 2023

I like the proposed solution, the use of the standard defined by W3 seems absolutely reasonable.

I think a typical use case from a data scientist's perspective is to read a file using a data frame library and then group single columns by meta-data information. e.g. group columns of the same biological replicate group to compute the mean of biological replicates.

For this, it would be nice to retrieve the meta-data based on the column name rather than the column index. I think it would be really convenient to use the "Label" column of the current dataset.xlsx draft to allow a mapping between the column header and the selector specified in the isa-sheet.

@kappe-c
Copy link

kappe-c commented Jun 12, 2023

Thanks for the good work. Comments:

  1. (Also regarding Data object extension ISA-tools/isa-specs#15 .) Does a selector really need a type? It only makes sense with a data file which is given by a filename. I think it would be great if in the ISA/ARC world we could trust the file extension to imply a certain data file type (e.g. .csv, .tab, etc. -> tabular) and thus a fitting selector. Being safe and generic and future-proof is great but I also like to Keep It Short and Simple.
  2. Taking inspiration from the W3 seems like a very good idea. But I, too, wonder how close we should actually stick to it. See the following.
  3. To pick up Lukas' example, named rows and columns are certainly convenient for human readers and writers of our metadata model.
  4. The W3 is – I think – more concerned with text (maybe image) documents; their example includes stuff like selecting paragraphs of text. While we are more concerned with (numeric/abstract) data. Binary files are one important element here, especially if ISA/ARC finds applicants outside of biology.
  5. For tabular data (some might say data defined on a grid) we probably want to extend row/column selectors to something n-dimensional, for an arbitrary integer n. An example for n=3 could be a list of tables like in a .xlsx or .ods file. (Here, again, giving the spreadsheet by name instead of numeric index is probably desirable.)

@Freymaurer Freymaurer changed the title [Feature Request] Enable Pointer logic for input/output [Design] Enable Pointer logic for input/output Jun 14, 2023
@HLWeil HLWeil transferred this issue from nfdi4plants/ARCtrl Dec 4, 2023
@Freymaurer
Copy link
Contributor Author

Two more ideas on how to do pointers in ArcTable:

  1. Handle selector as unit with the selector-string as value. Used "pseudo" selectors, as i did not want to write real ones.
Input Parameter [mean] Unit TSR TAN Parameter [quantity] Unit TSR TAN OUTPUT
File1 col=1 csv selector DPBO DPBO:1 col=2 csv selector DPBO DPBO:1 Maxquant
File2 col=3 csv selector DPBO DPBO:1 col=3 csv selector DPBO DPBO:1 Maxquant
  1. Row major approach, where selector could possibly replace the Output column

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

5 participants