Empower Users to Build More Kinds of Collections, More Intelligently #19377

jmchilton · 2025-01-06T19:51:05Z

Highlights

Support mixed paired and unpaired collections. New collection type single_or_paired #6063
Support creating more kinds of collections (without requiring rules)
Simpler collection creation - just pick datasets and Galaxy will decide what kind of collection to create and set everything up.
New Rule Based Import Activity.
Collection semantics documentation.
Record types - support heterogenous tuples of datasets.
Abstractions for adapting various model objects to collection inputs. Implement Abstraction for Adapting Various Model Objects to Tool Collection Inputs #19359
Fixing extension handling in the existing list builder. List creation misses remove file extension option #9497

Mixed paired and unpaired collections.

The semantics are discussed in the collection semantics documentation included in this PR.

Scientists running tools and workflows can create lists of mixed paired and unpaired data using new list builder functionality but it is marked as advanced and not encouraged actively because my understanding is this is a less common structure for an analysis. For tool developers however, I think the motivation is much more clear. Tools that consume list of paired and unpaired datasets can be passed the homogenous versions of these lists (list, list:paired) in workflows by users. This should enable workflows that just naturally support paired and unpaired data without the use of conditionals, etc..

The PR includes collection builder UI, workflow editor UI, API, and tool support for this collection type. The PR includes example tools with tool tests. The PR also includes a collection operation tool that converts a list:paired_or_unpaired collection into two homogeneous lists - one of type list and one of type list:paired. If users do create collections in this fashion - the collection operation should hopefully serve as an adapter that could let them leverage existing tools and workflows that expect homogenous lists.

More General Collection Building

This PR replaces the PairedListCollectionCreator with a generalization that I believe is easier to use, easier to understand, and can build more kinds of collections - including the new lists of paired and unpaired (list:paired_or_unpaired) collections but also nested lists (list:list) and nested lists of pairs (list:list:paired).

This separates the auto-pairing from the manual curation and I think as a result it is much easier to document and to grok without documentation.

All three of these collections types are now registered with the form code so they can be used to create collections using the work of @ahmedhamidawan in #18857.

Collection Builders - Wizard

Supporting new advanced options in the UI is great but we cannot stick all those options in the history multi-select drop down and provide any sort of context for what they do or what options a researcher should pick. This was a good problem because I think already having to make a choice at that point was hard. Galaxy can look at the data and tell if it is paired (I think) so why does that need to be a choice. For this reason there are now two options in the history selection drop down. “Auto Build List” and “Advanced Build List”.

Both of these launch the same wizard component in the center panel but auto places you on the last page with all the options select and advanced places you on the first page with advanced options expanded. The wizard will check the supplied data and give a paired summary to manually tweak if it looks paired and just land you in the list builder if it does not. 

Auto Builder Screenshots

The new dropdown option:

If your data look paired, you land up here and can just enter a name and go.

It has expanded help like all the other builders. The help will change based on the collection type being built.

If the data is unpaired, the normal list builder is landed upon.

If you go back to the first step you get this description of the basic options.

If you click the rule option, you get a builder embedded in the wizard instead of as its own modal as previously done.

There are tailored warnings for duplicate identifiers and un-paired datasets.

Advanced Builder Screenshots

The advanced builder option lands here instead and requires the user to select a type. It will still highlight list:paired if the data looks paired.

The configure pairing dialog is new here. It is ugly and could really use some help but I think it is more intelligible than the old options but more importantly... typically skipped. The simple version just skips this step and brings the user right into final step. This is a continuation of my ideas from #19253 - hide the advanced options in the typical case.

New Rule Based Import Activity

My work on #19329 is included here and extended on. More real estate in the central panel for the wizard for creating lists worked - so I thought doing the same for the rule builder seeding could work also. This PR adds an optional activity called “Rule Based Imports” that has a wizard for setting up rule builder uploads.

When using the rule builder from the upload form there is a very tortured upload dialog thing implemented in RulesInput.vue for seeding the rule builder with URIs and metadata for working with. The old component was very compact but also very unintuitive and just hacked together as quickly as possible and never really revisited. The wizard provides more space to explain things.

The new component also implements support for just dropping files containing lists or tables of URIs right onto the “Paste Table” form.

The old component:

Screenshots of the new activity-based approach:

The landing for the new activity. The activity is off by default but can be turned on by adding it when managing activities.

Select how to seed rules.

This version of the dialog now has the option to just drop files right into the box. That is a new feature.

Rather then having a modal... close and open a new modal when done... this version just includes the rule builder right in the wizard for a more consistent experience I think.

Collection semantics documentation

I think the semantics of how paired_or_unpaired works are easy to understand but the implementation is a bit tricky. As I was adding to our already numerous test cases for workflow connections and tool runtime stuff I realized we had okay test coverage but it was hard to understand or group how to extend that for the new concepts. terminals.test.ts is just a long unorganized list of usage cases without explanation, context, or organization. Our API tests are likewise disjointed. I implemented most of that I still felt lost every time I wanted to go add new tests.

My solution grew organically as I developed this but I think it is solid and can be extended in great ways in useful ways in the future. I have put together some plain English descriptions of collection mapping and reductions work within workflows and tools in a document called collection_semantics.yml. This is implemented as Markdown doc elements in the YAML. The document is then parsed and rendered into Markdown and included in Galaxy’s source docs.

In between the doc elements, I placed mathematical descriptions of general principles described in the English text that I called “examples”. The examples can correspond to test cases. The test cases can be rendered and included in the docs but they are hidden beneath details collapsibles by default to let the rendered document read more cleanly.

Each example also can have test case references attached. Currently two kinds of test case references are supported “tool_runtime” (just existing tool API test cases) and “workflow_editor” (references to it clauses from terminals.test.ts). This serves a really nice organization document for our test coverage and it makes it clear what still needs to be tested.

I’ve done several iterations of turning my math-y pseudo syntax for the test cases into modeled YAML. Each iteration allows me to render the examples in the Markdown better but it takes a lot of time and is sort of a niche activity. Ideally I would get to spend another few weeks to capture all the concepts in YAML, and generate cool diagrams of data structures in graphviz, place this information in the UI somewhere, auto-build the test cases in terminals.test.ts, generate new API test cases, extend test cases to include workflow_runtime API tests, etc… But I have to fight falling into the very fun side project, the point is we have important new documentation and a clear idea of what is covered and what still needs to be covered.

Record types - support heterogenous tuples of datasets

There is not an included user story here the way there is for paired_or_unpaired collection types - this is laying the foundations for user applications and developer goodies in other PRs. My work on sample sheets became very blocked on Galaxy not having heterogenous, tuple-style data structures for datasets. The database models support this usage but we didn’t have the tool usage, workflow inputs, collection plugin, etc…

One way to think about records is a generalization of paired collections. A paired collection is a record with field definitions of [{type: File, name: forward}, {type: File, name: reverse}] and paired_or_unpaired is sort of like [{type: File | null, name: forward}, {type: File | null, name: reverse}, {type: File | null, name: unpaired}] (sort of if you squint and ignore some details).

My sample sheet work went very far just assuming each row of the sample sheet corresponded to a single dataset. But workflow developers will want to collect metadata about pairs of datasets, optionally paired datasets, or other fixed combinations of files (two parents and a child, a control and a treatment, etc…). Hardening these fixed length collection types (records, paired_or_unpaired, paired) is I think needed to do a good job on building sample sheets.

Record types are used extensively in the CWL branch and integrating them has been on the todo list for the better part of a decade anyway.

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mschatz · 2025-01-06T20:06:58Z

Yes please! This looks incredibly useful!

bernt-matthias · 2025-01-07T09:10:25Z

Mixed paired and unpaired collections

I'm worried that the tool form would be simplified (which might be great for the user) on the cost of complicating the command block.

Do you have an example of a tool where this would be of use? In particular where all the collection elements go to one CLI parameter (otherwise we need to split the collection elements again in paired and unpaired).

Docs are wonderful. Foldable examples seem to be a good idea, but the sentences after each of the examples seems to refer to the example itself, or?

mvdbeek · 2025-01-07T14:07:56Z

Do you have an example of a tool where this would be of use?

Any tool that can take paired or single inputs, so almost all aligners.

on the cost of complicating the command block.

You're replacing a conditional cheetah block with another if statement that checks the input type. I don't think that changes the complexity. See the included example:

    #if $f1.has_single_item:
        cat $f1.single_item >> $out1;
        echo "Single item"
    #else
        cat $f1.forward $f1['reverse'] >> $out1;
        echo "Paired items"
    #end if

jmchilton · 2025-01-07T14:26:22Z

on the cost of complicating the command block

I would take the trade - the point of tools in some ways in to transfer requirements of knowledge and managing complexity from users and offload to the expertise of tool authors. But I think I agree whole heartedly with Marius that this does not do this. I think this eases the tool complexity. This lets you write simpler XML blocks I think and eliminate certain classes of conditionals. A simpler XML block means simpler nested when referring to parameters in the command block and swapping an if to the inside of the for instead of the outside of the for is not more complex.

Here is the the XML I came up with - I had generated a dummy spades wrapper using no knowledge and justing staring at the spades docs and not understanding any of it - so obviously let me know if this is all bullshit:

Before this PR:

<tool id="spades" name="SPAdes" version="3.15.4">
    <description>SPAdes genome assembler</description>
    <requirements>
        <requirement type="package" version="3.15.4">spades</requirement>
    </requirements>
    <command>
        spades.py
        #if $inputs.is_pooled == "yes":
            #if $inputs.paired.is_paired == "yes":
                #for $input in $inputs.paired.inputs_paired:
                    -pe-1 "${input.forward}"
                    -pe-2 "${input.reverse}"
                #end for
            #else
                #for $input in #inputs.paired.inputs_unpaired:
                    -s "${input}"
                #end for
            #end if
        #end if
        -o ${output}
    </command>
    <inputs>
        <conditional name="inputs" type="select">
            <param name="is_pooled" type="select" label="Are the inputs pooled?">
                <option value="yes">Yes</option>
                <option value="no">No</option>
            </param>
            <when value="yes">
                <conditional name="paired" type="select">
                    <param name="is_paired" type="select" label="Are the inputs paired?">
                        <option value="yes">Yes</option>
                        <option value="no">No</option>
                    </param>
                    <when value="yes">
                        <param name="inputs_paired" type="data_collection" collection_type="list:paired" format="fastqsanger" label="Reads"/>
                    </when>
                    <when value="no">
                        <param name="inputs_unpaired" type="data" multiple="true" format="fastqsanger" label="Reads"/>
                    </when>
                </conditional>
            </when>
            <when value="no">
                <!-- this branch will still be simplified but requires more work of the tool form
                    variety which is very scary. See comments at top of https://github.com/galaxyproject/galaxy/issues/19359. -->
            </when>
        </conditional>
    </inputs>
    <outputs>
        <data name="output" format="txt" label="SPAdes Output"/>
    </outputs>
    <help>
        This tool runs SPAdes genome assembler.
    </help>
</tool>

After this PR:

<tool id="spades" name="SPAdes" version="3.15.4">
    <description>SPAdes genome assembler</description>
    <requirements>
        <requirement type="package" version="3.15.4">spades</requirement>
    </requirements>
    <command>
        spades.py
        #if $inputs.is_pooled == "yes":
            #for $input in $inputs.inputs:
                #if $input.has_single_item:
                    -s "${input.single_item}"
                #else                
                    -pe-1 "${input.forward}"
                    -pe-2 "${input.reverse}"
                #end if
            #end if
        #end if
        -o ${output}
    </command>
    <inputs>
        <conditional name="inputs" type="select">
            <param name="is_pooled" type="select" label="Are the inputs pooled?">
                <option value="yes">Yes</option>
                <option value="no">No</option>
            </param>
            <when value="yes">
                <param name="inputs" type="data_collection" collection_type="list:paired_or_unpaired" format="fastqsanger" label="Reads" />
            </when>
            <when value="no">
                <!-- this branch will still be simplified but requires more work of the tool form
                    variety which is very scary. See comments at top of https://github.com/galaxyproject/galaxy/issues/19359. -->
            </when>
        </conditional>
    </inputs>
    <outputs>
        <data name="output" format="txt" label="SPAdes Output"/>
    </outputs>
    <help>
        This tool runs SPAdes genome assembler.
    </help>
</tool>

There are subtle differences here - since you cannot send individual datasets so the second list but you can to the first one. And the ability to use individual datasets in some ways is preserved in 24.2 because the tool has collection creators built in (well I guess only workflows but we can make that work with tools also no problem). But we want this kind of data in collections I think and for that the second version is just all win I think. The tool XML is much more concise, the command block is a bit more simpler because the references to variables aren't as complex, and we allow a whole new class of usage where the pooling has some paired and some unpaired data.

This isn't a panacea there is still too much complexity in these wrappers due to limitations in the tool form and the API that we can and should address - but I think this is a solid step forward at simplifying that complexity while also allowing new kinds of applications that would be even messier to implement using existing abstractions.

jmchilton · 2025-01-07T14:59:34Z

Docs are wonderful. Foldable examples seem to be a good idea, but the sentences after each of the examples seems to refer to the example itself, or?

I didn't actually expand any of the examples - all that appeared in those screenshots was the text - because the formatting is still not great. I think the math is useful but hard to render properly.

bernt-matthias · 2025-01-07T17:44:55Z

Thanks for the detailed explanations. Makes much more sense to me now.

This typing stuff and the details of the API call used are not details CollectionCreator.vue should be worried about I think.

This uses the wizard introduced by David for workflow exports instead of adapting the upload paradigm. The old component was very compact but also very unintuitive and just hacked together as quickly as possible and never really revisited. The wizard provides more space to explain things and is probably the way we want to go but still needs some serious love from someone better at UI than me. I need a component like this... but different for sample sheet seeding so I wanted to do a refresh and get something we're all more comfortable with ahead of that.

Existing dataset colleciton types are meant to be homogenous - all datasets of the same time. This introduces CWL-style record dataset collections.

jmchilton added area/UI-UX kind/feature area/API area/dataset-collections area/tool-framework highlight Included in user-facing release notes at the top labels Jan 6, 2025

jmchilton force-pushed the fixed_length_collections branch from 4c8add2 to aec3d8c Compare January 7, 2025 14:57

jmchilton force-pushed the fixed_length_collections branch from aec3d8c to 93702a4 Compare January 7, 2025 16:46

jmchilton force-pushed the fixed_length_collections branch 2 times, most recently from 1e435ec to b857aa9 Compare January 8, 2025 15:32

jmchilton mentioned this pull request Jan 9, 2025

[24.2] Fix to not display upload when creating collections from existing datasets. #19373

Merged

2 tasks

jmchilton added 13 commits January 9, 2025 10:46

Refactor collection name input out of CollectionCreator.

02eeb09

Refactor the source options out of collection creator.

bed1d57

Refactor collection creator help out to simplify.

e0fd0ad

Refactor showing of extensions into own component.

d6a19d0

Refactor collection creator footer buttons into a component.

6806a43

Refactor no items message out of collection creator for reuse.

f4a3e57

Don't show upload files option when selecting datasets.

75bb1f5

Remove unused mixin...

265fed5

Comment out unused style...

2742343

Refactor paired list creator for reuse, simplicity.

00c9116

Simplification to paired list creator.

3b78f3e

Migrate more of list collection builder out for reuse.

56d4265

Composable to simplify CollectionCreator, reuse.

4447a86

This typing stuff and the details of the API call used are not details CollectionCreator.vue should be worried about I think.

jmchilton added 24 commits January 9, 2025 10:46

Code de-duplication for element filtering using composable.

1f53ef9

Spelling fix in ListCollectionCreator.

fe60a1f

Cleanup list:paired builder comments...

8571700

bugfix: list collection builder should strip extensions...

8d8658d

useCollectionCreator for de-duplication.

28961a8

Build dataset collection input definition on the client.

9f19f26

Gray out repeat buttons that don't make sense.

2b5ce4b

Reuse in FormData.vue.

e487e5b

Decompose FormRepeat for reuse.

8fa4467

Event name: clicked-create -> on-create

8d46cbc

Reformat list collection creator...

a6271bf

Better typing around collection API calls in client.

0445d06

Allow swapping handsontable with AG Grid Community in Rule Builder.

2deaec9

Implement list building wizard.

346a04e

[WIP] Implement records - heterogenous dataset collections.

1275b58

Existing dataset colleciton types are meant to be homogenous - all datasets of the same time. This introduces CWL-style record dataset collections.

record ui.

db4ef5d

Database migration for fixed length collection migraton.

906f674

Migrate doctests to unit tests.

1e5bb8b

More dataset collection unit testing...

f81f8ba

Start trying to formalize dataset collection semantics.

6e097ed

Implement paired_or_unpaired collections...

d109975

Activity for rule builder imports.

416f5a8

Remove individual collection builders from history dropdown.

7f2b5cb

jmchilton force-pushed the fixed_length_collections branch from b857aa9 to 7f2b5cb Compare January 9, 2025 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empower Users to Build More Kinds of Collections, More Intelligently #19377

Empower Users to Build More Kinds of Collections, More Intelligently #19377

jmchilton commented Jan 6, 2025 •

edited

Loading

mschatz commented Jan 6, 2025

bernt-matthias commented Jan 7, 2025

mvdbeek commented Jan 7, 2025

jmchilton commented Jan 7, 2025 •

edited

Loading

jmchilton commented Jan 7, 2025

bernt-matthias commented Jan 7, 2025

Empower Users to Build More Kinds of Collections, More Intelligently #19377

Are you sure you want to change the base?

Empower Users to Build More Kinds of Collections, More Intelligently #19377

Conversation

jmchilton commented Jan 6, 2025 • edited Loading

Highlights

Mixed paired and unpaired collections.

More General Collection Building

Collection Builders - Wizard

Auto Builder Screenshots

Advanced Builder Screenshots

New Rule Based Import Activity

Collection semantics documentation

Record types - support heterogenous tuples of datasets

How to test the changes?

License

mschatz commented Jan 6, 2025

bernt-matthias commented Jan 7, 2025

mvdbeek commented Jan 7, 2025

jmchilton commented Jan 7, 2025 • edited Loading

jmchilton commented Jan 7, 2025

bernt-matthias commented Jan 7, 2025

jmchilton commented Jan 6, 2025 •

edited

Loading

jmchilton commented Jan 7, 2025 •

edited

Loading