Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial joins #16

Merged
merged 3 commits into from
Dec 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/R-CMD-check.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
- {os: windows-latest, r: 'release'}
- {os: ubuntu-latest, r: 'devel', http-user-agent: 'release'}
- {os: ubuntu-latest, r: 'release'}
- {os: ubuntu-latest, r: 'oldrel-1'}
# - {os: ubuntu-latest, r: 'oldrel-1'}

env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
Expand Down
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ Imports:
DBI,
dbplyr,
dplyr,
duckdb (>= 0.8.1)
duckdb (>= 0.8.1),
fs
Suggests:
curl,
sf,
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Generated by roxygen2: do not edit by hand

export(as_view)
export(cached_connection)
export(close_connection)
export(duckdb_s3_config)
export(load_spatial)
export(open_dataset)
export(spatial_join)
export(to_sf)
export(write_dataset)
9 changes: 7 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# duckdbfs 0.0.4

* `open_dataset()` gains option `sf` to format, allowing users to parse
spatial vector data in simple features standard (objects read by `sf`)
* `open_dataset()` gains the ability to read spatial vector data formats
(objects read by `sf`) using `format="sf"`
* default geometry column in `to_sf()` is now termed `geom`, to match the default
used in `duckdb`'s `st_read()` function.
* `open_dataset()` now tries to guess the data format instead of defaulting to
parquet when no format is explicitly provided.

* new function, `spatial_join()`, allows a variety of spatial joins.
* new helper function, `as_view()`, creates a temporary view of a query.

# duckdbfs 0.0.3

Expand Down
77 changes: 56 additions & 21 deletions R/open_dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@
#' column name across all files (NOTE: this can add considerably to
#' the initial execution time)
#' @param format The format of the dataset files. One of `"parquet"`, `"csv"`,
#' `"tsv"`, `"text"` or `"sf"` (for any Simple Features spatial dataset supported
#' by the sf package).
#' `"tsv"`, or `"sf"` (spatial vector files supported by the sf package / GDAL).
#' if no argument is provided, the function will try to guess the type based
#' on minimal heuristics.
#' @param conn A connection to a database.
#' @param tblname The name of the table to create in the database.
#' @param mode The mode to create the table in. One of `"VIEW"` or `"TABLE"`.
Expand Down Expand Up @@ -60,14 +61,16 @@ open_dataset <- function(sources,
schema = NULL,
hive_style = TRUE,
unify_schemas = FALSE,
format = c("parquet", "csv", "tsv", "text", "sf"),
format = c("parquet", "csv", "tsv", "sf"),
conn = cached_connection(),
tblname = tmp_tbl_name(),
mode = "VIEW",
filename = FALSE,
recursive = TRUE,
...) {

format <- select_format(sources, format)

sources <- parse_uri(sources, conn = conn, recursive = recursive)

if(length(list(...)) > 0) { # can also be specified in URI query notation
Expand All @@ -77,8 +80,6 @@ open_dataset <- function(sources,
# ensure active connection
version <- DBI::dbExecute(conn, "PRAGMA version;")


format <- match.arg(format)
if(format == "sf") {
load_spatial(conn = conn)
}
Expand All @@ -95,6 +96,46 @@ open_dataset <- function(sources,
dplyr::tbl(conn, tblname)
}

select_format <- function(sources, format) {
## does not guess file types in s3 buckets.

if(length(format) == 1) {
return(format)
}

# format for vector sources always based on first element
sources <- sources[[1]]

# default to parquet for S3 addresses
if(grepl("^s3://", sources)) {
return("parquet")
}

if( fs::is_dir(sources) ) {
sources <- fs::dir_ls(sources, recurse = TRUE, type="file")
sources <- sources[[1]]
}
format <- tools::file_ext(sources)

# detect spatial types
if(grepl("^/vsi", sources)) {
return("sf")
}
if(format %in% c("fgb", "shp", "json", "geojson", "gdb", "gpkg",
"kml", "gmt")) {
return("sf")
}


# default
if (format == "") {
return("parquet")
}

format
}


use_recursive <- function(sources) {
!all(identical(tools::file_ext(sources), ""))
}
Expand All @@ -112,22 +153,21 @@ query_string <- function(tblname,
union_by_name = FALSE,
filename = FALSE) {

format <- match.arg(format)
# format <- match.arg(format)
scanner <- switch(format,
"parquet" = "parquet_scan(",
"csv" = "read_csv_auto(",
"sf" = "st_read(",
"read_csv_auto("
)

source_uris <- vec_as_str(sources)

## Allow overwrites on VIEW
mode <- switch(mode,
"VIEW" = "OR REPLACE TEMPORARY VIEW",
"TABLE" = "TABLE")

scanner <- switch(format,
"parquet" = "parquet_scan(",
"csv" = "read_csv_auto(",
"tsv" = "read_csv_auto(",
"text" = "read_csv_auto(",
"sf" = "st_read("
)

tabular_options <- paste0(
", HIVE_PARTITIONING=",hive_partitioning,
", UNION_BY_NAME=",union_by_name,
Expand All @@ -136,12 +176,9 @@ query_string <- function(tblname,
options <- switch(format,
"parquet" = tabular_options,
"csv" = tabular_options,
"tsv" = tabular_options,
"text" = tabular_options,
"sf" = ""
"sf" = "",
tabular_options
)


paste0(
paste("CREATE", mode, tblname, "AS SELECT * FROM "),
paste0(scanner, source_uris, options,
Expand All @@ -152,8 +189,6 @@ query_string <- function(tblname,
tmp_tbl_name <- function(n = 15) {
paste0(sample(letters, n, replace = TRUE), collapse = "")
}


remote_src <- function(conn) {
dbplyr::remote_src(conn)
}
129 changes: 129 additions & 0 deletions R/spatial_join.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
#' spatial_join
#'
#' @param x a duckdb table with a spatial geometry column called "geom"
#' @param y a duckdb table with a spatial geometry column called "geom"
#' @param by A spatial join function, see details.
#' @param join JOIN type (left, right, inner, full)
#' @param args additional arguments to join function (e.g. distance for st_dwithin)
#' @param tblname name for the temporary view
#' @param conn the duckdb connection (imputed by duckdbfs by default,
#' must be shared across both tables)
#' @return a (lazy) view of the resulting table. Users can continue to operate
#' on using dplyr operations and call to_st() to collect this as an sf object.
#' @details
#'
#' Possible [spatial joins](https://postgis.net/workshops/postgis-intro/spatial_relationships.html) include:
#'
#' Function | Description
#' -------------------- | --------------------------------------------------------------------------------------------
#' st_intersects | Geometry A intersects with geometry B
#' st_disjoint | The complement of intersects
#' st_within | Geometry A is within geometry B (complement of contains)
#' st_dwithin | Geometries are within a specified distance, expressed in the same units as the coordinate reference system.
#' st_touches | Two polygons touch if the that have at least one point in common, even if their interiors do not touch.
#' st_contains | Geometry A entirely contains to geometry B. (complement of within)
#' st_containsproperly | stricter version of `st_contains` (boundary counts as external)
#' st_covers | geometry B is inside or on boundary of A. (A polygon covers a point on its boundary but does not contain it.)
#' st_overlaps | geometry A intersects but does not completely contain geometry B
#' st_equals | geometry A is equal to geometry B
#' st_crosses | Lines or points in geometry A cross geometry B.
#'
#' All though SQL is not case sensitive, this function expects only
#' lower case names for "by" functions.
#'
#' @examplesIf interactive()
#'
#' # note we can read in remote data in a variety of vector formats:
#' countries <-
#' paste0("/vsicurl/",
#' "https://github.com/cboettig/duckdbfs/",
#' "raw/spatial-read/inst/extdata/world.gpkg") |>
#' open_dataset(format = "sf")
#'
#' cities <-
#' paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/",
#' "spatial-read/inst/extdata/metro.fgb") |>
#' open_dataset(format = "sf")
#'
#' countries |>
#' dplyr::filter(iso_a3 == "AUS") |>
#' spatial_join(cities)
#'
#' @export
spatial_join <- function(x,
y,
by=c("st_intersects", "st_within",
"st_dwithin", "st_touches",
"st_contains", "st_containsproperly",
"st_covers", "st_overlaps",
"st_crosses", "st_equals",
"st_disjoint"),
args = "",
join="left",
tblname = tmp_tbl_name(),
conn = cached_connection()) {

by <- match.arg(by)
## x,y may be promised queries
x <- as_view(x)
y <- as_view(y)

# buil spatial join query
x.name <- remote_name(x, conn)
y.name <- remote_name(y, conn)
x.geom <- paste0(x.name, ".geom")
y.geom <- paste0(y.name, ".geom")

if(args != ""){
args <- paste(",", args)
}

# be more careful than SELECT *

# x.geom becomes the "geom" column, y.geom becomes geom:1
query <- paste(
"SELECT *",
"FROM", x.name,
join, "JOIN", y.name,
"ON", paste0(by, "(", x.geom, ", ", y.geom, args, ")")
)
query_to_view(query, tblname, conn)

}


#' as_view
#'
#' Create a View of the current query. This can be an effective way to allow
#' a query chain to remain lazy
#' @param x a duckdb spatial dataset
#' @inheritParams open_dataset
#' @examplesIf interactive()
#' path <- system.file("extdata/spatial-test.csv", package="duckdbfs")
#' df <- open_dataset(path)
#' library(dplyr)
#'
#' df |> filter(latitude > 5) |> as_view()
#'
#' @export
as_view <- function(x, tblname = tmp_tbl_name(), conn = cached_connection()) {

# assert x is a tbl_lazy, a tbl_sql, and a tbl_duckdb_connection

## lazy_base_query objects are good to go.
if(inherits(x$lazy_query, "lazy_base_query")) {
return(x)
}
## lazy_select_query objects are unnamed,
## convert to named views so we can re-use them in queries
q <- dbplyr::sql_render(x)
query_to_view(q, tblname, conn)
}

query_to_view <- function(query,
tblname = tmp_tbl_name(),
conn = cached_connection()) {
q <- paste("CREATE OR REPLACE TEMPORARY VIEW", tblname, "AS", query)
DBI::dbSendQuery(conn, q)
dplyr::tbl(conn, tblname)
}
2 changes: 0 additions & 2 deletions R/write_dataset.R
Original file line number Diff line number Diff line change
Expand Up @@ -32,9 +32,7 @@ write_dataset <- function(dataset,
DBI::dbWriteTable(conn, name = tblname, value = dataset)

} else {

tblname <- as.character(remote_name(dataset, conn))

}

path <- parse_uri(path, conn = conn, recursive = FALSE)
Expand Down
36 changes: 36 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,42 @@ sf_obj <- countries |> filter(continent == "Africa") |> to_sf()
plot(sf_obj["name"])
```

## Spatial joins

One very common operation are spatial joins, which can be a very powerful way to subset large data.
For instance, we can return all points (cities) within a set of polygons

```{r}
cities <-
paste0("/vsicurl/https://github.com/cboettig/duckdbfs/raw/",
"spatial-read/inst/extdata/metro.fgb") |>
open_dataset(format = "sf")

countries |>
dplyr::filter(continent == "Oceania") |>
spatial_join(cities, by = "st_intersects", join="inner") |>
select(name_long, sovereignt, pop2020)

```


Possible [spatial joins](https://postgis.net/workshops/postgis-intro/spatial_relationships.html) include:

Function | Description
-------------------- | --------------------------------------------------------------------------------------------
st_intersects | Geometry A intersects with geometry B
st_disjoint | The complement of intersects
st_within | Geometry A is within geometry B (complement of contains)
st_dwithin | Geometries are within a specified distance, expressed in the same units as the coordinate reference system.
st_touches | Two polygons touch if the that have at least one point in common, even if their interiors do not touch.
st_contains | Geometry A entirely contains to geometry B. (complement of within)
st_containsproperly | stricter version of `st_contains` (boundary counts as external)
st_covers | geometry B is inside or on boundary of A. (A polygon covers a point on its boundary but does not contain it.)
st_overlaps | geometry A intersects but does not completely contain geometry B
st_equals | geometry A is equal to geometry B
st_crosses | Lines or points in geometry A cross geometry B.

Note that while SQL functions are not case-sensitive, `spatial_join` expects lower-case names.

## Writing datasets

Expand Down
Loading