-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dagster-obstore #27450
base: master
Are you sure you want to change the base?
feat: dagster-obstore #27450
Conversation
@dpeng817 you might taking a look at this? It will simplify the ComputeLogManagers a lot, and won't require heavy dependencies. All these stores can create pre-signed urls so that provides the download link to the log files. I've tested it for Azure and S3 and it works. |
Hey @ion-elgreco - I'd like to understand better the pain you were originally facing that led you to making this integration. Obstore seems like a really awesome technology, but it's also new, and it's yet another python dependency. It scares me a bit to take a dependency on it in our aws, azure, and gcp integrations. Aws especially is super high traffic - and the idea of introducing a dependency on a relatively green package which might stop being updated at any time I think makes it an initial no-go to try and replace the existing functionality in any of azure, aws or gcp with this. I think to start we'll want to have this live in the community integrations repo, which I understand will make it more annoying to, say, deploy from the helm chart, but unfortunately might be the necessary tradeoff for now. To make that part of things less painful, perhaps we could expose helm chart config for custom compute log managers (IE you can specify module and config arbitrarily for a custom compute log manager). Thoughts about that approach? |
I can understand, but it's not entirely new. The rust crate "object store" is battle tested and production grade, used in various databases, query engines and so forth. The python binding is a quite minimal layer to this rust crate. No updates at no time, won't even be that big of an issue, blob apis are stable so in theory you could use the same version for years but outside of that, the maintainer of it is very active so I don't think this will happen. And it happens, you could fork it and manually bump the crate version to get the latest updates Tbh, having it be part of the community integrations would be more overhead to me with no benefit, since it will take longer to push changes when needed. My goal is rather to make this part of the core maintained libraries and have it as optional in the helm chart and already included in the docker images, the obstore wheel itself is only 4MB. |
Summary & Motivation
Obstore is a python binding for the Rust crate Object-store, much faster than boto3 and other equivalent implementations for interacting with cloud object stores.
This supports S3,Azure,GCS from a single library.
How I Tested These Changes
Would need some help here, for AWS I just copied the tests from dagster_aws. Is this fine?
I've manually tested the Azure and S3 integration and it works, logs show in the UI, and I can download the logs with presigned urls.
Changelog