Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow dataset inputs via {src: "url", ...} dictionaries. #18797

Closed
wants to merge 10 commits into from

Conversation

jmchilton
Copy link
Member

@jmchilton jmchilton commented Sep 10, 2024

I'm trying to keep it highly aligned with the data fetch API. Hence {"src": "url", ...} and not {"src": "uri", ....} - also "ext" instead "file_type" and "dbkey" instead of "genome_build". I'll try to implement all the rest of the options there. So far I have implemented deferred: true/false though in both tools and workflows. The implementation is slightly different for deferred and undeferred datasets and the implementations of these is discussed more below.

Tool Implementation

  • The tool implementation requires the new/unmerged "tool request" API (xref [WIP] Implement Tool Request API #18745).
  • This can work by keeping the HDAs deferred until the job or by "materializing" them in the Celery task and handing them off as normal ("ok") HDAs to the rest of the job creation/tool execution stack. There are advantages to both approaches but keeping them deferred isn't going to be super useful for main until we make Pulsar work with deferred datasets.
  • Building on the tool state work was a joy compared basic.py and everything is very regimented and validated at each step. It is great to see the idea play out with a new application so cleanly.
  • The celery task is getting a bit bulky. Even with all the new features added - this is still way better than what is happening a web thread currently because it is in Celery workers but I fun future direction might be to use task chaining to break the processing happening in queue_jobs into multiple tasks.

Workflow Implementation

  • The workflow mode just creates the needed HDAs during the in web process workflow queueing process. This isn't intrinsically any more computationally intensive than any of the operations related to copying inputs into a new history or copying library datasets into the history - all of which currently happens in the web thread.
  • For deferred datasets, these datasets are ready to go to job handlers as is. If the datasets aren't sent to the API as deferred, there is a new step in workflow invocation iterations that is responsible for "materializing" the datasets. The workflow invocation is created in a "requires_materialization" state - it will never be set back to this state so this shouldn't slow down scheduling a workflow invocation after its first iteration.

Other Work & Broader Context

Builds on the tool request API (formerly #18745 part of the structured tool state work #17393).

I've also started work on the tool and workflow "landing" concept at 08b6ed9. If we can create landing requests with URLs supplied for inputs - external sites will be able to provide really nice launch points into tools and workflows with their own hosted data.

That said... being able to just use these APIS without needing to understand how to "fetch" or "upload" data into Galaxy is a really nice win in its own right - this work doesn't depend on the "landing" concept to be useful.

How to test the changes?

(Select all options that apply)

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jmchilton jmchilton closed this Sep 17, 2024
@jmchilton jmchilton mentioned this pull request Dec 2, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant