Let file sources choose a path for uploaded files #19154

kysrpex · 2024-11-18T09:06:31Z

The method write_from() of SingleFileSource and BaseFilesSource reads a local file from native_path and saves it to target_path on a file source.

This PR enables the service backing the file source to choose which will be the path of the saved file, meaning that target_path and the actual path where the file can be recovered later do not have to match. The latter is the return value of write_from().

Therefore, all usages of write_from() have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object (StoreExportAssociation.export_metadata).

The changes from this PR are meant to address a fundamental design mismatch between Galaxy and eLabFTW that impedes the smooth integration of the latter. Galaxy file sources are designed to save files to a target_uri which is set before the file is saved, partly by the user, who chooses the basename in the URI. On the other hand, files can be retrieved from eLabFTW via an entity_type, entity_id and upload_id, and the latter two are set by eLabFTW. There is no good mapping between target_uri and entity_type, entity_id and upload_id that can work without breaking the assumption that "saving a file on path x guarantees that it can be retrieved later using x", which is precisely what this PR does. The issue #18665 includes a more thorough explanation of this design mismatch.

This is the first PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

The method `write_from()` of `SingleFileSource` and `BaseFilesSource` reads a local file from `native_path` and saves it to `target_path` on a file source. This commit allows the service backing the file source to choose which will be the path of the saved file, meaning that `target_path` and the actual path where the file can be recovered later do not have to match. The latter is the return value of `write_from()`. Therefore, all usages of `write_from()` have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object.

jmchilton · 2024-11-18T16:38:19Z

I've read through the attached issue and I am really nervous about this approach. It seems like there is a mismatch because the abstractions are pretty different. It seems like a augmenting the model import/export process would make more sense than this lower level abstraction. Do have the details of the plugin worked through that prompted this change? David's comment on that issue seems to vaguely point in this direction as well.

kysrpex · 2024-11-25T11:32:05Z

I've read through the attached issue and I am really nervous about this approach. It seems like there is a mismatch because the abstractions are pretty different. It seems like a augmenting the model import/export process would make more sense than this lower level abstraction. Do have the details of the plugin worked through that prompted this change? David's comment on that issue seems to vaguely point in this direction as well.

Sorry for not answering earlier, I was out of office for most of the week.

Indeed there is a very large mismatch between the abstractions from Galaxy and eLabFTW. Galaxy works with a unique URI that is fixed before uploading to the file source, whose basename is chosen by the user; eLabFTW works with a pair of autoincrementing identifiers that are fixed after creating the experiment or resource or uploading the file and a string enum with two choices (i.e. experiment or item).

About what @davelopez commented on the issue

I need to study the case a bit, but as a first impression, this case clearly will need a new special UI entry here

This UI will have to create the needed entities before the "export" similar to what the RDM file sources are doing. Then, once you have a proper URI that identifies the target entity (something like: elabftw://{elab_url}/entity_type/entity_id/upload_id) perform the upload in the backend. I don't know if that is possible, I haven't checked the eLabFTW API but that could be a potential solution.

You can create entities via eLabFTW, and you can have as many files attached to them as you wish, therefore you do not necessarily need to create an entity to export a file, in fact most of the time you will be interested in exporting files to an existing entity, because they have a lot of metadata that should be filled through the eLabFTW interface.

Do have the details of the plugin worked through that prompted this change?

I am not sure if you are referring to the design of the plugin or the code of the plugin. The latter is almost ready here, although I do not think it makes sense to have a look yet (let's tackle problems step by step, there is another large problem/mismatch to discuss later). What prompts the change when working the plugin through is the moment one has to choose a valid mapping between eLabFTW entity type, entity id and upload id and Galaxy URI that respects their properties:

Galaxy URI
- Pathname fixed by Galaxy, basename chosen by the user, both before exporting the file (the user-set basename is not a problem, it can just be appended to the URI and ignored)
- File can be retrieved later on the same URI
eLabFTW (entity type, entity id and upload id)
- Autoincrementing entity ids fixed by eLabFTW after creating the entity, next id unpredictable (can be known only by the admin, problems with concurrently creating entities)
- Autoincrementing upload ids fixed by eLabFTW after uploading the file, next id unpredictable (problems with concurrent uploads)

In other words, the simple part of the problem is "Construct an injective function from the set of nonnegative integer 2-tuples to the set of strings complying with the definition of a path according to RFC 3986". The impossible part of the problem is that Galaxy needs to apply the inverse function without having anything to apply it to!

Anyway, @davelopez wants to have closer look at this together with me to see if the change from this PR can be avoided. If a better solution can be found then fine, otherwise augmenting the model import/export process would make sense, and I do not mind putting the effort. Let's first have this closer look and then we may continue the discussion.

Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.

Define a new class `FileSourceModelExportStore` that abstracts the common details of `BcoModelExportStore`, `ROCrateArchiveModelExportStore`, `TarModelExportStore` and `BagArchiveModelExportStore`. This new class manages exports to file sources, from where data can be retrieved later on via a URI. It takes the responsibility of creating a temporary directory to set up the file to export and uploading it to the file source.

kysrpex · 2024-11-28T12:50:40Z

lib/galaxy/managers/model_stores.py

-        export_metadata = self.set_history_export_request_metadata(request)
+
+        exception_exporting_history: Optional[Exception] = None
        try:
-            with model.store.get_export_store_factory(
+            export_store = model.store.get_export_store_factory(
                self._app, model_store_format, export_files=export_files, user_context=user_context
-            )(target_uri) as export_store:
+            )(target_uri)
+            with export_store:
                history = self._history_manager.by_id(request.history_id)
                export_store.export_history(
                    history, include_hidden=request.include_hidden, include_deleted=request.include_deleted
                )
-                self.set_history_export_result_metadata(request.export_association_id, export_metadata, success=True)
+            request.target_uri = str(export_store.file_source_uri) or request.target_uri
        except Exception as e:
+            exception_exporting_history = e
+            raise
+        finally:
+            export_metadata = self.set_history_export_request_metadata(request)
            self.set_history_export_result_metadata(
-                request.export_association_id, export_metadata, success=False, error=str(e)
+                request.export_association_id,
+                export_metadata,
+                success=not bool(exception_exporting_history),
+                error=str(exception_exporting_history) if exception_exporting_history else None,
            )
-            raise


Before applying the set of changes from 53012bf, the method ModelStoreManager.set_history_export_request_metadata() instantiates a ExportObjectMetadata Pydantic model and dumps it to the database in the form of JSON as the field StoreExportAssociation.export_metadata. After the export is complete, the method set_history_export_result_metadata() takes the same instance of ExportObjectMetadata, instantiates a ExportObjectResultMetadata Pydantic model, sets it as the result_data of the ExportObjectMetadata instance, and then saves the ExportObjectMetadata Pydantic model in the form of JSON to the database again.

After applying the set of changes, the call to ModelStoreManager.set_history_export_request_metadata() is delayed until the file has already been saved to the file source, as the actual URI that the file source assigns to the file is otherwise unknown.

The URI assigned by the file source overwrites the original target URI in the request. This involves a slight deviation from the previous behavior: if for example, power gets cut at the right time, StoreExportAssociation.export_metadata may not exist despite the history having been already saved to the file source, because database writes happen within the finally: block.

Moreover, overwriting the original target URI from the request is formally wrong, because the actual URI assigned by the file source should be part of the export result metadata, as it becomes known when the export completes. However, that implies modifying the other parts of the codebase that reference the URI from the request.

Despite the slight deviation in behavior and the formal incorrectness, rather than jumping straight into fixing these issues, I think it makes sense to leave the chance for discussion open, as doing things this way may still be an interesting tradeoff. Let me know what you think.

I was thinking...

After applying the set of changes, the call to ModelStoreManager.set_history_export_request_metadata() is delayed until the file has already been saved to the file source, as the actual URI that the file source assigns to the file is otherwise unknown.

If we merge the PR as it is, then we'd never see an export that is in progress in the list from the UI. If the file is large, that alone would justify attempting to fix

overwriting the original target URI from the request is formally wrong, because the actual URI assigned by the file source should be part of the export result metadata, as it becomes known when the export completes

right? I guess it makes sense to make an attempt.

kysrpex · 2024-11-28T12:58:06Z

@jmchilton After discussing with @davelopez, it seems that it is hard to avoid that write_from() returns the URI assigned by the file source, because this is precisely when this information is provided by the "external system". There was however room for simplification and avoiding to touch the subclasses (see 5926778). It also made sense to remove a few lines of code by introducing a new abstraction FileSourceModelExportStore (see a88ffd0). I still do not like that the URI assigned by the file source gets saved as an attribute of the context manager, imo it would make sense to have a more "functional" approach, but at the same time I do not see how else could it be done.

jmchilton · 2024-12-13T20:20:07Z

I appreciate you taking the time to at least simplify the implementation classes with that change being applied in the base class. I'm still very uneasy but I will live - especially if @davelopez is okay with this approach.

davelopez · 2024-12-16T12:57:36Z

Thanks for the trust @jmchilton 😄 🙏 I know it is somewhat of an API change and we are streching a bit the original implementation, but I think it is safe to assume the normal case would be returning the same target_path from the write_from method in 99.9% of the cases which should be harmless. If it doesn't, the plugin should be responsible for handling it as in this particular case that motivated the change.

kysrpex self-assigned this Nov 18, 2024

github-actions bot added area/testing area/database Galaxy's database or data access layer area/tool-framework labels Nov 18, 2024

kysrpex added area/backend and removed area/testing area/database Galaxy's database or data access layer area/tool-framework labels Nov 18, 2024

kysrpex force-pushed the file_sources_assign_paths branch 3 times, most recently from 4d4b9a6 to bc3b95a Compare November 18, 2024 09:47

kysrpex force-pushed the file_sources_assign_paths branch from bc3b95a to 53012bf Compare November 18, 2024 16:02

kysrpex added 4 commits November 28, 2024 12:04

Keep BaseFilesSource API backwards compatible

5926778

Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.

Ignore mypy [arg-type] errors on unmodified functions

62c8955

Merge branch 'dev' into file_sources_assign_paths

6fe9264

kysrpex commented Nov 28, 2024

View reviewed changes

kysrpex changed the title ~~[WIP] Let file sources choose a path for uploaded files~~ Let file sources choose a path for uploaded files Nov 28, 2024

kysrpex marked this pull request as ready for review November 28, 2024 13:04

github-actions bot added this to the 25.0 milestone Nov 28, 2024

kysrpex mentioned this pull request Dec 5, 2024

Support listing file sources asynchronously #19256

Closed

4 tasks

kysrpex mentioned this pull request Dec 12, 2024

eLabFTW integration via Galaxy file source #19319

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let file sources choose a path for uploaded files #19154

Let file sources choose a path for uploaded files #19154

kysrpex commented Nov 18, 2024 •

edited

Loading

jmchilton commented Nov 18, 2024

kysrpex commented Nov 25, 2024

kysrpex Nov 28, 2024 •

edited

Loading

kysrpex Dec 17, 2024

kysrpex commented Nov 28, 2024 •

edited

Loading

jmchilton commented Dec 13, 2024

davelopez commented Dec 16, 2024

Let file sources choose a path for uploaded files #19154

Are you sure you want to change the base?

Let file sources choose a path for uploaded files #19154

Conversation

kysrpex commented Nov 18, 2024 • edited Loading

How to test the changes?

License

jmchilton commented Nov 18, 2024

kysrpex commented Nov 25, 2024

kysrpex Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

kysrpex Dec 17, 2024

Choose a reason for hiding this comment

kysrpex commented Nov 28, 2024 • edited Loading

jmchilton commented Dec 13, 2024

davelopez commented Dec 16, 2024

kysrpex commented Nov 18, 2024 •

edited

Loading

kysrpex Nov 28, 2024 •

edited

Loading

kysrpex commented Nov 28, 2024 •

edited

Loading