Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds support for scanning parquet from GCP #1056

Merged
merged 16 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
### New features

- `$cut()` and `$qcut()` to bin continuous values into discrete categories (#1057).
- `pl$scan_parquet()` and `pl$read_parquet()` gain an argument `storage_options`
to scan/read data via cloud storage providers (GCP, AWS, Azure). Note that this
support is experimental (#1056, @andyquinterom).

### Bug fixes

Expand Down
2 changes: 1 addition & 1 deletion R/extendr-wrappers.R
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ import_arrow_ipc <- function(path, n_rows, cache, rechunk, row_name, row_index,

new_from_ndjson <- function(path, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_index_name, row_index_offset, ignore_errors) .Call(wrap__new_from_ndjson, path, infer_schema_length, batch_size, n_rows, low_memory, rechunk, row_index_name, row_index_offset, ignore_errors)

new_from_parquet <- function(path, n_rows, cache, parallel, rechunk, row_name, row_index, use_statistics, low_memory, hive_partitioning) .Call(wrap__new_from_parquet, path, n_rows, cache, parallel, rechunk, row_name, row_index, use_statistics, low_memory, hive_partitioning)
new_from_parquet <- function(path, n_rows, cache, parallel, rechunk, row_name, row_index, storage_options, use_statistics, low_memory, hive_partitioning) .Call(wrap__new_from_parquet, path, n_rows, cache, parallel, rechunk, row_name, row_index, storage_options, use_statistics, low_memory, hive_partitioning)

test_rpolarserr <- function() .Call(wrap__test_rpolarserr)

Expand Down
5 changes: 3 additions & 2 deletions R/io_csv.R
Original file line number Diff line number Diff line change
Expand Up @@ -216,8 +216,9 @@ check_is_link = function(path, reuse_downloaded, raise_error = FALSE) {
# try download file if valid url
if (!is.null(con)) {
close(con)
if (is.null(cache_temp_file[[actual_url]]))
cache_temp_file[[actual_url]] <- tempfile()
if (is.null(cache_temp_file[[actual_url]])) {
cache_temp_file[[actual_url]] = tempfile()
}
if (isFALSE(reuse_downloaded) || isFALSE(file.exists(cache_temp_file[[actual_url]]))) {
download.file(url = actual_url, destfile = cache_temp_file[[actual_url]])
message(paste("tmp file placed in \n", cache_temp_file[[actual_url]]))
Expand Down
28 changes: 26 additions & 2 deletions R/io_parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,28 @@
#' and use them to prune reads.
#' @param use_statistics Use statistics in the parquet file to determine if pages
#' can be skipped from reading.
#' @param storage_options Experimental. List of options necessary to scan
#' parquet files from different cloud storage providers (GCP, AWS, Azure).
#' See the 'Details' section.
#' @rdname IO_scan_parquet
#' @details
#' ## Connecting to cloud providers
#'
#' Polars supports scanning parquet files from different cloud providers.
#' The cloud providers currently supported are AWS, GCP, and Azure.
#' The supported keys to pass to the `storage_options` argument can be found
#' here:
#'
#' - [aws](https://docs.rs/object_store/latest/object_store/aws/enum.AmazonS3ConfigKey.html)
#' - [gcp](https://docs.rs/object_store/latest/object_store/gcp/enum.GoogleConfigKey.html)
#' - [azure](https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html)
#'
#' ### Implementation details
#'
#' - Currently it is impossible to scan public parquet files from GCP without
#' a valid service account. Be sure to always include a service account in the
#' `storage_options` argument.
#'
#' @examplesIf requireNamespace("arrow", quietly = TRUE) && arrow::arrow_with_dataset() && arrow::arrow_with_parquet()
#' temp_dir = tempfile()
#' # Write a hive-style partitioned parquet dataset
Expand Down Expand Up @@ -46,6 +67,7 @@ pl_scan_parquet = function(
hive_partitioning = TRUE,
rechunk = FALSE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE) {
new_from_parquet(
Expand All @@ -58,15 +80,16 @@ pl_scan_parquet = function(
row_index = row_index_offset,
low_memory = low_memory,
use_statistics = use_statistics,
hive_partitioning = hive_partitioning
hive_partitioning = hive_partitioning,
storage_options = storage_options
) |>
unwrap("in pl$scan_parquet():")
}

#' Read a parquet file
#' @rdname IO_read_parquet
#' @inherit pl_read_csv return
#' @inheritParams pl_scan_parquet
#' @inherit pl_scan_parquet params details
#' @examplesIf requireNamespace("arrow", quietly = TRUE) && arrow::arrow_with_dataset() && arrow::arrow_with_parquet()
#' temp_dir = tempfile()
#' # Write a hive-style partitioned parquet dataset
Expand Down Expand Up @@ -98,6 +121,7 @@ pl_read_parquet = function(
hive_partitioning = TRUE,
rechunk = TRUE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE) {
.args = as.list(environment())
Expand Down
27 changes: 27 additions & 0 deletions man/IO_read_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions man/IO_scan_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion src/Makevars.win
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ LIBNAME = libr_polars.a
TARGET_DIR = $(CURDIR)/rust/target
LIBDIR = $(TARGET_DIR)/$(TARGET)/$(LIBR_POLARS_PROFILE)
STATLIB = $(LIBDIR)/$(LIBNAME)
PKG_LIBS = -L$(LIBDIR) -lr_polars -lws2_32 -ladvapi32 -luserenv -lbcrypt -lole32 -lntdll -lpsapi -liphlpapi -lpdh -lpowrprof -loleaut32 -lnetapi32 -lsecur32 -lsynchronization -t
PKG_LIBS = -L$(LIBDIR) -lr_polars -lws2_32 -lncrypt -lcrypt32 -ladvapi32 -luserenv -lbcrypt -lole32 -lntdll -lpsapi -liphlpapi -lpdh -lpowrprof -loleaut32 -lnetapi32 -lsecur32 -lsynchronization -t

# Rtools42 doesn't have the linker in the location that cargo expects, so we
# need to overwrite it via configuration.
Expand Down
Loading
Loading