Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dagster-obstore #27450

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

ion-elgreco
Copy link
Contributor

@ion-elgreco ion-elgreco commented Jan 30, 2025

Summary & Motivation

Obstore is a python binding for the Rust crate Object-store, much faster than boto3 and other equivalent implementations for interacting with cloud object stores.

This supports S3,Azure,GCS from a single library.

How I Tested These Changes

Would need some help here, for AWS I just copied the tests from dagster_aws. Is this fine?

I've manually tested the Azure and S3 integration and it works, logs show in the UI, and I can download the logs with presigned urls.

Changelog

Insert changelog entry or delete this section.

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Jan 30, 2025

@dpeng817 you might taking a look at this?

It will simplify the ComputeLogManagers a lot, and won't require heavy dependencies. All these stores can create pre-signed urls so that provides the download link to the log files.

I've tested it for Azure and S3 and it works.

@danielgafni danielgafni self-requested a review January 31, 2025 14:18
@dpeng817
Copy link
Contributor

Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this.

I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now.

To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager).

Thoughts about that approach?

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Jan 31, 2025

Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this.

I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now.

To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager).

Thoughts about that approach?

I can understand, but it's not entirely new. The rust crate "object store" is battle tested and production grade, used in various databases, query engines and so forth. The python binding is a quite minimal layer to this rust crate.

No updates at no time, won't even be that big of an issue, blob apis are stable so in theory you could use the same version for years but outside of that, the maintainer of it is very active so I don't think this will happen. And it happens, you could fork it and manually bump the crate version to get the latest updates

Tbh, having it be part of the community integrations would be more overhead to me with no benefit, since it will take longer to push changes when needed.

My goal is rather to make this part of the core maintained libraries and have it as optional in the helm chart and already included in the docker images, the obstore wheel itself is only 4MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants