[Design] Enable Pointer logic for input/output #80

Freymaurer · 2023-06-06T07:47:41Z

@HLWeil has this written out in his fork of ISA specification.

Freymaurer · 2023-06-06T07:51:19Z

HLWeil · 2023-06-09T12:23:12Z

Selection of the selectors

As stated in the PR linked above, Selectors defined by W3 could be a standardized solution.
Here there are in principal two sets of selectors defined:

Fragment selectors, which are the selectors used in URIs and IRIs (The part in the adress after the #)

Name	Fragment Specification	Description
HTML	http://tools.ietf.org/rfc/rfc3236	[rfc3236] Example: namedSection
PDF	http://tools.ietf.org/rfc/rfc3778	[rfc3778] Example: page=10&viewrect=50,50,640,480
Plain Text	http://tools.ietf.org/rfc/rfc5147	[rfc5147] Example: char=0,10
XML	http://tools.ietf.org/rfc/rfc3023	[rfc3023] Example: xpointer(/a/b/c)
RDF/XML	http://tools.ietf.org/rfc/rfc3870	[rfc3870] Example: namedResource
CSV	http://tools.ietf.org/rfc/rfc7111	[rfc7111] Example: row=5-7
Media	http://www.w3.org/TR/media-frags/	[media-frags] Example: xywh=50,50,640,480
SVG	http://www.w3.org/TR/SVG/	[SVG11] Example: svgView(viewBox(50,50,640,480))
EPUB3	http://www.idpf.org/epub/linking/cfi/epub-cfi.html	[cfi] Example: epubcfi(/6/4[chap01ref]!/4[body01]/10[para05]/3:10)

Various other selectors

Fragment selectors are naturally part of an IRI, so they are all defined to have a single string syntax. The other selectors found in the document can have a more complex syntax with different fields.
As we will have to include the selector in our tabular view (e.g. isa.assay.xlsx file), it would definitely be easier to stick to those which can be described using a single string (So we don't need different alternative columns).

Just from my personal experience, the most important data formats which follow some generic syntax are tabular and xml based. These would be covered by the selectors shown above. For more complex formats which don't follow any generic logic, the plain text selector could be used. Unfortunately there is no fragment selector for binary files. For this, the data position selector exists, which does not define a string based representantion. But maybe it could just be handled like the plain-text selector?

Maybe json might also be an important target. Not sure if XPointer logic might be recycled for this?

General list

This would lead to the following selection of selectors:

CSV (Tabular)
XML (via XPointer, maybe also json)
Plain-Text (for complex formats)
Data (for binaries, would be an extension of the specification)

CSV (Tabular) constraints

Unfortunately, the CSV fragment specification does only define index based selection:

e.g.:

row=5-7
col=1-2
cell=4,1-6,2

In many cases, we would probably want to annotate columns and rows using the key strings.

e.g.:

row="MySample1"

This would necessitate extending the specificitation linked above with thorough and accessible information.

HLWeil · 2023-06-09T12:54:47Z

Format

Recently, we decided on denoting the input and output columns using the Input [Type] and Output [Type] syntax. So in an assay, where an Raw Data is the output we would have the following column Output [Raw Data].

With the addtion of selectors, this could further be qualified with an Output Selector column. E.g. in the case of tabular data, we could use the CSV fragment selector to select a specific column:

This might then look like this:

A thing to consider is whether to make a term out of the selector. The problem here is, that -in contrast to the isa headers- the value will always be a string. Kinda odd:

HLWeil · 2023-06-09T12:56:39Z

Any input would be welcome, @Freymaurer, @omaus, @kMutagene, @muehlhaus, @ZimmerD, @kappe-c, @Brilator

ZimmerD · 2023-06-12T10:13:15Z

I like the proposed solution, the use of the standard defined by W3 seems absolutely reasonable.

I think a typical use case from a data scientist's perspective is to read a file using a data frame library and then group single columns by meta-data information. e.g. group columns of the same biological replicate group to compute the mean of biological replicates.

For this, it would be nice to retrieve the meta-data based on the column name rather than the column index. I think it would be really convenient to use the "Label" column of the current dataset.xlsx draft to allow a mapping between the column header and the selector specified in the isa-sheet.

kappe-c · 2023-06-12T10:52:03Z

Thanks for the good work. Comments:

(Also regarding Data object extension ISA-tools/isa-specs#15 .) Does a selector really need a type? It only makes sense with a data file which is given by a filename. I think it would be great if in the ISA/ARC world we could trust the file extension to imply a certain data file type (e.g. .csv, .tab, etc. -> tabular) and thus a fitting selector. Being safe and generic and future-proof is great but I also like to Keep It Short and Simple.
Taking inspiration from the W3 seems like a very good idea. But I, too, wonder how close we should actually stick to it. See the following.
To pick up Lukas' example, named rows and columns are certainly convenient for human readers and writers of our metadata model.
The W3 is – I think – more concerned with text (maybe image) documents; their example includes stuff like selecting paragraphs of text. While we are more concerned with (numeric/abstract) data. Binary files are one important element here, especially if ISA/ARC finds applicants outside of biology.
For tabular data (some might say data defined on a grid) we probably want to extend row/column selectors to something n-dimensional, for an arbitrary integer n. An example for n=3 could be a list of tables like in a .xlsx or .ods file. (Here, again, giving the spreadsheet by name instead of numeric index is probably desirable.)

Freymaurer · 2024-01-25T10:55:56Z

Two more ideas on how to do pointers in ArcTable:

Handle selector as unit with the selector-string as value. Used "pseudo" selectors, as i did not want to write real ones.

Input	Parameter [mean]	Unit	TSR	TAN	Parameter [quantity]	Unit	TSR	TAN	OUTPUT
File1	col=1	csv selector	DPBO	DPBO:1	col=2	csv selector	DPBO	DPBO:1	Maxquant
File2	col=3	csv selector	DPBO	DPBO:1	col=3	csv selector	DPBO	DPBO:1	Maxquant

Row major approach, where selector could possibly replace the Output column

Freymaurer changed the title ~~[Feature Request] Enable Pointer logic for input/output~~ [Design] Enable Pointer logic for input/output Jun 14, 2023

HLWeil transferred this issue from nfdi4plants/ARCtrl Dec 4, 2023

HLWeil mentioned this issue Jan 24, 2024

Rework Data Nodes #93

Merged

Freymaurer mentioned this issue Jan 25, 2024

[Feature Request] Add pointer logic nfdi4plants/ARCtrl#300

Closed

HLWeil mentioned this issue Jun 5, 2024

Datamap specification #104

Merged

kMutagene added this to ARCStack Jun 5, 2024

kMutagene added this to the ARC-specification v2.0.0 milestone Jun 6, 2024

HLWeil closed this as completed Jun 6, 2024

github-project-automation bot moved this to Done in ARCStack Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design] Enable Pointer logic for input/output #80

[Design] Enable Pointer logic for input/output #80

Freymaurer commented Jun 6, 2023

Freymaurer commented Jun 6, 2023

HLWeil commented Jun 9, 2023 •

edited

Loading

HLWeil commented Jun 9, 2023

HLWeil commented Jun 9, 2023

ZimmerD commented Jun 12, 2023

kappe-c commented Jun 12, 2023

Freymaurer commented Jan 25, 2024

[Design] Enable Pointer logic for input/output #80

[Design] Enable Pointer logic for input/output #80

Comments

Freymaurer commented Jun 6, 2023

Freymaurer commented Jun 6, 2023

HLWeil commented Jun 9, 2023 • edited Loading

Selection of the selectors

General list

CSV (Tabular) constraints

HLWeil commented Jun 9, 2023

Format

HLWeil commented Jun 9, 2023

ZimmerD commented Jun 12, 2023

kappe-c commented Jun 12, 2023

Freymaurer commented Jan 25, 2024

HLWeil commented Jun 9, 2023 •

edited

Loading