-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds support for scanning parquet from GCP #1056
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andyquinterom, I don't have any experience with interacting directly with cloud providers from R but this looks nice. I also can't compile on Windows for now, I have the same errors as in some of the CI. I'll explore a bit why.
In the meantime, I added a few minor comments. I would also like to know how we can test this. Is there a way to mock the connection to GCP and other services?
Thanks for taking the time to read my submission. I'll fix some details you commented, it was late last night and things could be improved. I'll check the compilation errors in windows as well to see if I find anything. On the mocking side, this functionality is fully on the Rust polars-io crate. I don't know if we could truly mock it or rely on the rust testing implementation. I will see how the Python layer is tested for an idea. |
@etiennebacher I just pushed some commits that should fix the issues you mentioned |
@etiennebacher It seems the CI fail algo happens on |
One failure is due to |
@etiennebacher |
@etiennebacher Seems like the CI issues are solved. |
I still see a similar error here: https://github.com/pola-rs/r-polars/actions/runs/8805357520/job/24168950149?pr=1056#step:9:335
Ideally we want to test our implementation as well. They take care of all the details but we want to ensure that our R binding is correct. I don't know whether they can mock this in python but if they do, it would be great to mimic it. |
I'm looking into the tests it seems like during testing an S3 bucket is created. This would make it pretty difficult to test unless we get some credential for creating and uploading files to S3. |
@etiennebacher it seems like I finally got it to link https://github.com/pola-rs/r-polars/actions/runs/8808439596/job/24177961258 |
Thanks, I can compile too. For testing, I think we can use some S3/GCS buckets that are publicly available, like those in this However, the following doesn't work for me and I don't know why: library(arrow)
arrow_with_gcs()
#> [1] TRUE
arrow::read_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet") |>
head()
#> carat color clarity depth table price x y z
#> 1 0.23 E VS1 56.9 65 327 4.05 4.07 2.31
#> 2 0.31 J SI2 63.3 58 335 4.34 4.35 2.75
#> 3 0.30 J SI1 64.0 55 339 4.25 4.28 2.73
#> 4 0.30 J SI1 63.4 54 351 4.23 4.29 2.70
#> 5 0.30 J SI1 63.8 56 351 4.23 4.26 2.71
#> 6 0.30 I SI2 63.3 56 351 4.26 4.30 2.71
pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")
#> Error: Execution halted with the following contexts
#> 0: In R: in pl$scan_parquet():
#> 0: During function call [pl$scan_parquet("gs://anonymous@voltrondata-labs-datasets/diamonds/cut=Good/part-0.parquet")]
#> 1: Encountered the following error in Rust-Polars:
#> Generic GCS error: Error performing token request: Error after 2 retries in 3.3478343s, max_retries:2, retry_timeout:10s,
#> source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken):
#> error trying to connect: tcp connect error: deadline has elapsed @eitsupi do you have some ideas about this? |
I think Polars does not support such addresses, because I get an error when I try that string in Python Polars. |
This also happens on the Rust side, if we don't have a Service Account given it will error out since it doesn't know where to look for the metadata of the bucket. GCP buckets seem to not be tested in the Python side. The only way to fix this "issue" would be to run the GCP test suite on GCP servers or use a service account. |
I know of a server product that simulates such a GCS, but I don't know if polars can connect to this (and if it is even necessary to test with it). At the moment I do not have the time to test this. |
If I provide a service account it works with the dataset in the arrow vignette.
I guess I could try to create a fake service account that is enough to have the Rust side of things to the request as it needs. I will try to look into the Rust code to see if there are improvements to be made however, that would take quite some time to dig into. |
@etiennebacher I would also be willing to create a dummy service account, but I am not very confident that's outside of what the project would be comfortable doing due to possible security concerns. |
How does the
I don't think that's the way to go. I suppose the ideal solution would be for rust-polars to support this kind of address but I don't know if they plan to do it (and I won't request this there because I never use this type of service). For now, let's just mention that this argument is experimental but eventually we'll need some way to test this. I'll review more properly tomorrow but you can already rename this arg from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second pass of comments. Can you also add a bullet point in NEWS that mentions this is experimental?
@etiennebacher Just pushed the changes. I tried to stay almost as similar as possible to the python documentation. Tell me if you need anything more done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I just added a news item. Actually I see this PR adds this arg only for pl$scan_parquet()
, but it should also be in pl$read_parquet()
. Can you add it too? Sorry, I forgot that earlier
@etiennebacher I did not change the |
@andyquinterom I added the arg in read_parquet so that I can merge once the ci passes. Many thanks for your contribution! |
@etiennebacher I just noticed your comment. I had just pushed the changes as well :/ Tell me if you want me to revert the changes or if they are ok as they are? |
I'll wrap up the PR, you can let it as is (the details section is missing in the docs of read_parquet) |
Related to #1054
This should support GCP parquet scanning.
I tested this snippet and it works perfectly.