Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let file sources choose a path for uploaded files #19154

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Nov 18, 2024

The method write_from() of SingleFileSource and BaseFilesSource reads a local file from native_path and saves it to target_path on a file source.

This PR enables the service backing the file source to choose which will be the path of the saved file, meaning that target_path and the actual path where the file can be recovered later do not have to match. The latter is the return value of write_from().

Therefore, all usages of write_from() have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object (StoreExportAssociation.export_metadata).

The changes from this PR are meant to address a fundamental design mismatch between Galaxy and eLabFTW that impedes the smooth integration of the latter. Galaxy file sources are designed to save files to a target_uri which is set before the file is saved, partly by the user, who chooses the basename in the URI. On the other hand, files can be retrieved from eLabFTW via an entity_type, entity_id and upload_id, and the latter two are set by eLabFTW. There is no good mapping between target_uri and entity_type, entity_id and upload_id that can work without breaking the assumption that "saving a file on path x guarantees that it can be retrieved later using x", which is precisely what this PR does. The issue #18665 includes a more thorough explanation of this design mismatch.

This is the first PR of a series of PRs that integrate eLabFTW with Galaxy via a file source (together they address issue #18665):

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@kysrpex kysrpex self-assigned this Nov 18, 2024
@github-actions github-actions bot added area/testing area/database Galaxy's database or data access layer area/tool-framework labels Nov 18, 2024
@kysrpex kysrpex added area/backend and removed area/testing area/database Galaxy's database or data access layer area/tool-framework labels Nov 18, 2024
@kysrpex kysrpex force-pushed the file_sources_assign_paths branch 3 times, most recently from 4d4b9a6 to bc3b95a Compare November 18, 2024 09:47
The method `write_from()` of `SingleFileSource` and `BaseFilesSource` reads a local file from `native_path` and saves it to `target_path` on a file source.

This commit allows the service backing the file source to choose which will be the path of the saved file, meaning that `target_path` and the actual path where the file can be recovered later do not have to match. The latter is the return value of `write_from()`.

Therefore, all usages of `write_from()` have also been refactored to consider the paths chosen by the file source's backing service. In addition, when exporting a history, the URI that the service backing the file source assigns to it will be saved to the history export result metadata object.
@kysrpex kysrpex force-pushed the file_sources_assign_paths branch from bc3b95a to 53012bf Compare November 18, 2024 16:02
@jmchilton
Copy link
Member

I've read through the attached issue and I am really nervous about this approach. It seems like there is a mismatch because the abstractions are pretty different. It seems like a augmenting the model import/export process would make more sense than this lower level abstraction. Do have the details of the plugin worked through that prompted this change? David's comment on that issue seems to vaguely point in this direction as well.

@kysrpex
Copy link
Contributor Author

kysrpex commented Nov 25, 2024

I've read through the attached issue and I am really nervous about this approach. It seems like there is a mismatch because the abstractions are pretty different. It seems like a augmenting the model import/export process would make more sense than this lower level abstraction. Do have the details of the plugin worked through that prompted this change? David's comment on that issue seems to vaguely point in this direction as well.

Sorry for not answering earlier, I was out of office for most of the week.

Indeed there is a very large mismatch between the abstractions from Galaxy and eLabFTW. Galaxy works with a unique URI that is fixed before uploading to the file source, whose basename is chosen by the user; eLabFTW works with a pair of autoincrementing identifiers that are fixed after creating the experiment or resource or uploading the file and a string enum with two choices (i.e. experiment or item).

About what @davelopez commented on the issue

I need to study the case a bit, but as a first impression, this case clearly will need a new special UI entry here
screenshot
This UI will have to create the needed entities before the "export" similar to what the RDM file sources are doing. Then, once you have a proper URI that identifies the target entity (something like: elabftw://{elab_url}/entity_type/entity_id/upload_id) perform the upload in the backend. I don't know if that is possible, I haven't checked the eLabFTW API but that could be a potential solution.

You can create entities via eLabFTW, and you can have as many files attached to them as you wish, therefore you do not necessarily need to create an entity to export a file, in fact most of the time you will be interested in exporting files to an existing entity, because they have a lot of metadata that should be filled through the eLabFTW interface.

Do have the details of the plugin worked through that prompted this change?

I am not sure if you are referring to the design of the plugin or the code of the plugin. The latter is almost ready here, although I do not think it makes sense to have a look yet (let's tackle problems step by step, there is another large problem/mismatch to discuss later). What prompts the change when working the plugin through is the moment one has to choose a valid mapping between eLabFTW entity type, entity id and upload id and Galaxy URI that respects their properties:

  • Galaxy URI
    • Pathname fixed by Galaxy, basename chosen by the user, both before exporting the file (the user-set basename is not a problem, it can just be appended to the URI and ignored)
    • File can be retrieved later on the same URI
  • eLabFTW (entity type, entity id and upload id)
    • Autoincrementing entity ids fixed by eLabFTW after creating the entity, next id unpredictable (can be known only by the admin, problems with concurrently creating entities)
    • Autoincrementing upload ids fixed by eLabFTW after uploading the file, next id unpredictable (problems with concurrent uploads)

In other words, the simple part of the problem is "Construct an injective function from the set of nonnegative integer 2-tuples to the set of strings complying with the definition of a path according to RFC 3986". The impossible part of the problem is that Galaxy needs to apply the inverse function without having anything to apply it to!

Anyway, @davelopez wants to have closer look at this together with me to see if the change from this PR can be avoided. If a better solution can be found then fine, otherwise augmenting the model import/export process would make sense, and I do not mind putting the effort. Let's first have this closer look and then we may continue the discussion.

Change the implementation so that the definitions of `_write_from` in classes that inherit from `BaseFilesSource` do not need to change.
Define a new class `FileSourceModelExportStore` that abstracts the common details of `BcoModelExportStore`, `ROCrateArchiveModelExportStore`, `TarModelExportStore` and `BagArchiveModelExportStore`.

This new class manages exports to file sources, from where data can be retrieved later on via a URI. It takes the responsibility of creating a temporary directory to set up the file to export and uploading it to the file source.
Comment on lines -217 to -231
export_metadata = self.set_history_export_request_metadata(request)

exception_exporting_history: Optional[Exception] = None
try:
with model.store.get_export_store_factory(
export_store = model.store.get_export_store_factory(
self._app, model_store_format, export_files=export_files, user_context=user_context
)(target_uri) as export_store:
)(target_uri)
with export_store:
history = self._history_manager.by_id(request.history_id)
export_store.export_history(
history, include_hidden=request.include_hidden, include_deleted=request.include_deleted
)
self.set_history_export_result_metadata(request.export_association_id, export_metadata, success=True)
request.target_uri = str(export_store.file_source_uri) or request.target_uri
except Exception as e:
exception_exporting_history = e
raise
finally:
export_metadata = self.set_history_export_request_metadata(request)
self.set_history_export_result_metadata(
request.export_association_id, export_metadata, success=False, error=str(e)
request.export_association_id,
export_metadata,
success=not bool(exception_exporting_history),
error=str(exception_exporting_history) if exception_exporting_history else None,
)
raise
Copy link
Contributor Author

@kysrpex kysrpex Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before applying the set of changes from 53012bf, the method ModelStoreManager.set_history_export_request_metadata() instantiates a ExportObjectMetadata Pydantic model and dumps it to the database in the form of JSON as the field StoreExportAssociation.export_metadata. After the export is complete, the method set_history_export_result_metadata() takes the same instance of ExportObjectMetadata, instantiates a ExportObjectResultMetadata Pydantic model, sets it as the result_data of the ExportObjectMetadata instance, and then saves the ExportObjectMetadata Pydantic model in the form of JSON to the database again.

After applying the set of changes, the call to ModelStoreManager.set_history_export_request_metadata() is delayed until the file has already been saved to the file source, as the actual URI that the file source assigns to the file is otherwise unknown.

The URI assigned by the file source overwrites the original target URI in the request. This involves a slight deviation from the previous behavior: if for example, power gets cut at the right time, StoreExportAssociation.export_metadata may not exist despite the history having been already saved to the file source, because database writes happen within the finally: block.

Moreover, overwriting the original target URI from the request is formally wrong, because the actual URI assigned by the file source should be part of the export result metadata, as it becomes known when the export completes. However, that implies modifying the other parts of the codebase that reference the URI from the request.

Despite the slight deviation in behavior and the formal incorrectness, rather than jumping straight into fixing these issues, I think it makes sense to leave the chance for discussion open, as doing things this way may still be an interesting tradeoff. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking...

After applying the set of changes, the call to ModelStoreManager.set_history_export_request_metadata() is delayed until the file has already been saved to the file source, as the actual URI that the file source assigns to the file is otherwise unknown.

If we merge the PR as it is, then we'd never see an export that is in progress in the list from the UI. If the file is large, that alone would justify attempting to fix

overwriting the original target URI from the request is formally wrong, because the actual URI assigned by the file source should be part of the export result metadata, as it becomes known when the export completes

right? I guess it makes sense to make an attempt.

@kysrpex
Copy link
Contributor Author

kysrpex commented Nov 28, 2024

@jmchilton After discussing with @davelopez, it seems that it is hard to avoid that write_from() returns the URI assigned by the file source, because this is precisely when this information is provided by the "external system". There was however room for simplification and avoiding to touch the subclasses (see 5926778). It also made sense to remove a few lines of code by introducing a new abstraction FileSourceModelExportStore (see a88ffd0). I still do not like that the URI assigned by the file source gets saved as an attribute of the context manager, imo it would make sense to have a more "functional" approach, but at the same time I do not see how else could it be done.

@kysrpex kysrpex changed the title [WIP] Let file sources choose a path for uploaded files Let file sources choose a path for uploaded files Nov 28, 2024
@kysrpex kysrpex marked this pull request as ready for review November 28, 2024 13:04
@github-actions github-actions bot added this to the 25.0 milestone Nov 28, 2024
@jmchilton
Copy link
Member

I appreciate you taking the time to at least simplify the implementation classes with that change being applied in the base class. I'm still very uneasy but I will live - especially if @davelopez is okay with this approach.

@davelopez
Copy link
Contributor

Thanks for the trust @jmchilton 😄 🙏 I know it is somewhat of an API change and we are streching a bit the original implementation, but I think it is safe to assume the normal case would be returning the same target_path from the write_from method in 99.9% of the cases which should be harmless. If it doesn't, the plugin should be responsible for handling it as in this particular case that motivated the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants