Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empower Users to Build More Kinds of Collections, More Intelligently #19377

Draft
wants to merge 37 commits into
base: dev
Choose a base branch
from

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Jan 6, 2025

Highlights

Mixed paired and unpaired collections.

The semantics are discussed in the collection semantics documentation included in this PR.

Scientists running tools and workflows can create lists of mixed paired and unpaired data using new list builder functionality but it is marked as advanced and not encouraged actively because my understanding is this is a less common structure for an analysis. For tool developers however, I think the motivation is much more clear. Tools that consume list of paired and unpaired datasets can be passed the homogenous versions of these lists (list, list:paired) in workflows by users. This should enable workflows that just naturally support paired and unpaired data without the use of conditionals, etc..

The PR includes collection builder UI, workflow editor UI, API, and tool support for this collection type. The PR includes example tools with tool tests. The PR also includes a collection operation tool that converts a list:paired_or_unpaired collection into two homogeneous lists - one of type list and one of type list:paired. If users do create collections in this fashion - the collection operation should hopefully serve as an adapter that could let them leverage existing tools and workflows that expect homogenous lists.

More General Collection Building

This PR replaces the PairedListCollectionCreator with a generalization that I believe is easier to use, easier to understand, and can build more kinds of collections - including the new lists of paired and unpaired (list:paired_or_unpaired) collections but also nested lists (list:list) and nested lists of pairs (list:list:paired).

This separates the auto-pairing from the manual curation and I think as a result it is much easier to document and to grok without documentation.

All three of these collections types are now registered with the form code so they can be used to create collections using the work of @ahmedhamidawan in #18857.

Collection Builders - Wizard

Supporting new advanced options in the UI is great but we cannot stick all those options in the history multi-select drop down and provide any sort of context for what they do or what options a researcher should pick. This was a good problem because I think already having to make a choice at that point was hard. Galaxy can look at the data and tell if it is paired (I think) so why does that need to be a choice. For this reason there are now two options in the history selection drop down. “Auto Build List” and “Advanced Build List”.

Both of these launch the same wizard component in the center panel but auto places you on the last page with all the options select and advanced places you on the first page with advanced options expanded. The wizard will check the supplied data and give a paired summary to manually tweak if it looks paired and just land you in the list builder if it does not.


Auto Builder Screenshots

The new dropdown option:

1_new_dropdown

If your data look paired, you land up here and can just enter a name and go.

2_auto_builder_paired_landing

It has expanded help like all the other builders. The help will change based on the collection type being built.

3_auto_builder_help

If the data is unpaired, the normal list builder is landed upon.

4_auto_builder_list

If you go back to the first step you get this description of the basic options.

5_list_builder_what

If you click the rule option, you get a builder embedded in the wizard instead of as its own modal as previously done.

6_list_wizard_rules

There are tailored warnings for duplicate identifiers and un-paired datasets.

11_duplicate_identifiers 10_unpaired_warning

Advanced Builder Screenshots

The advanced builder option lands here instead and requires the user to select a type. It will still highlight list:paired if the data looks paired.

7_advanced_list_builder_what

The configure pairing dialog is new here. It is ugly and could really use some help but I think it is more intelligible than the old options but more importantly... typically skipped. The simple version just skips this step and brings the user right into final step. This is a continuation of my ideas from #19253 - hide the advanced options in the typical case.

8_configure_pairing 9_configure_pairing_filters

New Rule Based Import Activity

My work on #19329 is included here and extended on. More real estate in the central panel for the wizard for creating lists worked - so I thought doing the same for the rule builder seeding could work also. This PR adds an optional activity called “Rule Based Imports” that has a wizard for setting up rule builder uploads.

When using the rule builder from the upload form there is a very tortured upload dialog thing implemented in RulesInput.vue for seeding the rule builder with URIs and metadata for working with. The old component was very compact but also very unintuitive and just hacked together as quickly as possible and never really revisited. The wizard provides more space to explain things.

The new component also implements support for just dropping files containing lists or tables of URIs right onto the “Paste Table” form.

The old component:

Screenshot 2024-12-16 at 9 32 00 AM

Screenshots of the new activity-based approach:

The landing for the new activity. The activity is off by default but can be turned on by adding it when managing activities.

12_rule_based_import

Select how to seed rules.

13_rule_based_what

This version of the dialog now has the option to just drop files right into the box. That is a new feature.

14_paste_or_drop

Rather then having a modal... close and open a new modal when done... this version just includes the rule builder right in the wizard for a more consistent experience I think.

15_rules_import

Collection semantics documentation

I think the semantics of how paired_or_unpaired works are easy to understand but the implementation is a bit tricky. As I was adding to our already numerous test cases for workflow connections and tool runtime stuff I realized we had okay test coverage but it was hard to understand or group how to extend that for the new concepts. terminals.test.ts is just a long unorganized list of usage cases without explanation, context, or organization. Our API tests are likewise disjointed. I implemented most of that I still felt lost every time I wanted to go add new tests.

My solution grew organically as I developed this but I think it is solid and can be extended in great ways in useful ways in the future. I have put together some plain English descriptions of collection mapping and reductions work within workflows and tools in a document called collection_semantics.yml. This is implemented as Markdown doc elements in the YAML. The document is then parsed and rendered into Markdown and included in Galaxy’s source docs.

In between the doc elements, I placed mathematical descriptions of general principles described in the English text that I called “examples”. The examples can correspond to test cases. The test cases can be rendered and included in the docs but they are hidden beneath details collapsibles by default to let the rendered document read more cleanly.

Each example also can have test case references attached. Currently two kinds of test case references are supported “tool_runtime” (just existing tool API test cases) and “workflow_editor” (references to it clauses from terminals.test.ts). This serves a really nice organization document for our test coverage and it makes it clear what still needs to be tested.

I’ve done several iterations of turning my math-y pseudo syntax for the test cases into modeled YAML. Each iteration allows me to render the examples in the Markdown better but it takes a lot of time and is sort of a niche activity. Ideally I would get to spend another few weeks to capture all the concepts in YAML, and generate cool diagrams of data structures in graphviz, place this information in the UI somewhere, auto-build the test cases in terminals.test.ts, generate new API test cases, extend test cases to include workflow_runtime API tests, etc… But I have to fight falling into the very fun side project, the point is we have important new documentation and a clear idea of what is covered and what still needs to be covered.

16_docs 17_docs_paired_or_unpaired

Record types - support heterogenous tuples of datasets

There is not an included user story here the way there is for paired_or_unpaired collection types - this is laying the foundations for user applications and developer goodies in other PRs. My work on sample sheets became very blocked on Galaxy not having heterogenous, tuple-style data structures for datasets. The database models support this usage but we didn’t have the tool usage, workflow inputs, collection plugin, etc…

One way to think about records is a generalization of paired collections. A paired collection is a record with field definitions of [{type: File, name: forward}, {type: File, name: reverse}] and paired_or_unpaired is sort of like [{type: File | null, name: forward}, {type: File | null, name: reverse}, {type: File | null, name: unpaired}] (sort of if you squint and ignore some details).

My sample sheet work went very far just assuming each row of the sample sheet corresponded to a single dataset. But workflow developers will want to collect metadata about pairs of datasets, optionally paired datasets, or other fixed combinations of files (two parents and a child, a control and a treatment, etc…). Hardening these fixed length collection types (records, paired_or_unpaired, paired) is I think needed to do a good job on building sample sheets.

Record types are used extensively in the CWL branch and integrating them has been on the todo list for the better part of a decade anyway.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@mschatz
Copy link
Member

mschatz commented Jan 6, 2025

Yes please! This looks incredibly useful!

@bernt-matthias
Copy link
Contributor

Mixed paired and unpaired collections

I'm worried that the tool form would be simplified (which might be great for the user) on the cost of complicating the command block.

Do you have an example of a tool where this would be of use? In particular where all the collection elements go to one CLI parameter (otherwise we need to split the collection elements again in paired and unpaired).

Docs are wonderful. Foldable examples seem to be a good idea, but the sentences after each of the examples seems to refer to the example itself, or?

@mvdbeek
Copy link
Member

mvdbeek commented Jan 7, 2025

Do you have an example of a tool where this would be of use?

Any tool that can take paired or single inputs, so almost all aligners.

on the cost of complicating the command block.

You're replacing a conditional cheetah block with another if statement that checks the input type. I don't think that changes the complexity. See the included example:

    #if $f1.has_single_item:
        cat $f1.single_item >> $out1;
        echo "Single item"
    #else
        cat $f1.forward $f1['reverse'] >> $out1;
        echo "Paired items"
    #end if

@jmchilton
Copy link
Member Author

jmchilton commented Jan 7, 2025

on the cost of complicating the command block

I would take the trade - the point of tools in some ways in to transfer requirements of knowledge and managing complexity from users and offload to the expertise of tool authors. But I think I agree whole heartedly with Marius that this does not do this. I think this eases the tool complexity. This lets you write simpler XML blocks I think and eliminate certain classes of conditionals. A simpler XML block means simpler nested when referring to parameters in the command block and swapping an if to the inside of the for instead of the outside of the for is not more complex.

Here is the the XML I came up with - I had generated a dummy spades wrapper using no knowledge and justing staring at the spades docs and not understanding any of it - so obviously let me know if this is all bullshit:

Before this PR:

<tool id="spades" name="SPAdes" version="3.15.4">
    <description>SPAdes genome assembler</description>
    <requirements>
        <requirement type="package" version="3.15.4">spades</requirement>
    </requirements>
    <command>
        spades.py
        #if $inputs.is_pooled == "yes":
            #if $inputs.paired.is_paired == "yes":
                #for $input in $inputs.paired.inputs_paired:
                    -pe-1 "${input.forward}"
                    -pe-2 "${input.reverse}"
                #end for
            #else
                #for $input in #inputs.paired.inputs_unpaired:
                    -s "${input}"
                #end for
            #end if
        #end if
        -o ${output}
    </command>
    <inputs>
        <conditional name="inputs" type="select">
            <param name="is_pooled" type="select" label="Are the inputs pooled?">
                <option value="yes">Yes</option>
                <option value="no">No</option>
            </param>
            <when value="yes">
                <conditional name="paired" type="select">
                    <param name="is_paired" type="select" label="Are the inputs paired?">
                        <option value="yes">Yes</option>
                        <option value="no">No</option>
                    </param>
                    <when value="yes">
                        <param name="inputs_paired" type="data_collection" collection_type="list:paired" format="fastqsanger" label="Reads"/>
                    </when>
                    <when value="no">
                        <param name="inputs_unpaired" type="data" multiple="true" format="fastqsanger" label="Reads"/>
                    </when>
                </conditional>
            </when>
            <when value="no">
                <!-- this branch will still be simplified but requires more work of the tool form
                    variety which is very scary. See comments at top of https://github.com/galaxyproject/galaxy/issues/19359. -->
            </when>
        </conditional>
    </inputs>
    <outputs>
        <data name="output" format="txt" label="SPAdes Output"/>
    </outputs>
    <help>
        This tool runs SPAdes genome assembler.
    </help>
</tool>

After this PR:

<tool id="spades" name="SPAdes" version="3.15.4">
    <description>SPAdes genome assembler</description>
    <requirements>
        <requirement type="package" version="3.15.4">spades</requirement>
    </requirements>
    <command>
        spades.py
        #if $inputs.is_pooled == "yes":
            #for $input in $inputs.inputs:
                #if $input.has_single_item:
                    -s "${input.single_item}"
                #else                
                    -pe-1 "${input.forward}"
                    -pe-2 "${input.reverse}"
                #end if
            #end if
        #end if
        -o ${output}
    </command>
    <inputs>
        <conditional name="inputs" type="select">
            <param name="is_pooled" type="select" label="Are the inputs pooled?">
                <option value="yes">Yes</option>
                <option value="no">No</option>
            </param>
            <when value="yes">
                <param name="inputs" type="data_collection" collection_type="list:paired_or_unpaired" format="fastqsanger" label="Reads" />
            </when>
            <when value="no">
                <!-- this branch will still be simplified but requires more work of the tool form
                    variety which is very scary. See comments at top of https://github.com/galaxyproject/galaxy/issues/19359. -->
            </when>
        </conditional>
    </inputs>
    <outputs>
        <data name="output" format="txt" label="SPAdes Output"/>
    </outputs>
    <help>
        This tool runs SPAdes genome assembler.
    </help>
</tool>

There are subtle differences here - since you cannot send individual datasets so the second list but you can to the first one. And the ability to use individual datasets in some ways is preserved in 24.2 because the tool has collection creators built in (well I guess only workflows but we can make that work with tools also no problem). But we want this kind of data in collections I think and for that the second version is just all win I think. The tool XML is much more concise, the command block is a bit more simpler because the references to variables aren't as complex, and we allow a whole new class of usage where the pooling has some paired and some unpaired data.

This isn't a panacea there is still too much complexity in these wrappers due to limitations in the tool form and the API that we can and should address - but I think this is a solid step forward at simplifying that complexity while also allowing new kinds of applications that would be even messier to implement using existing abstractions.

@jmchilton jmchilton force-pushed the fixed_length_collections branch from 4c8add2 to aec3d8c Compare January 7, 2025 14:57
@jmchilton
Copy link
Member Author

Docs are wonderful. Foldable examples seem to be a good idea, but the sentences after each of the examples seems to refer to the example itself, or?

I didn't actually expand any of the examples - all that appeared in those screenshots was the text - because the formatting is still not great. I think the math is useful but hard to render properly.

Screenshot 2025-01-07 at 9 57 48 AM

@jmchilton jmchilton force-pushed the fixed_length_collections branch from aec3d8c to 93702a4 Compare January 7, 2025 16:46
@bernt-matthias
Copy link
Contributor

Thanks for the detailed explanations. Makes much more sense to me now.

This uses the wizard introduced by David for workflow exports instead of adapting the upload paradigm. The old component was very compact but also very unintuitive and just hacked together as quickly as possible and never really revisited. The wizard provides more space to explain things and is probably the way we want to go but still needs some serious love from someone better at UI than me.

I need a component like this... but different for sample sheet seeding so I wanted to do a refresh and get something we're all more comfortable with ahead of that.
Existing dataset colleciton types are meant to be homogenous - all datasets of the same time. This introduces CWL-style record dataset collections.
@jmchilton jmchilton force-pushed the fixed_length_collections branch from b857aa9 to 7f2b5cb Compare January 9, 2025 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants