Replies: 8 comments 14 replies
-
That depends on the circumstances... In the general case indeed not, but a special case to handle that has been added in GDAL 3.8.0 to use the bbox from Ouverture Maps to skip row groups: OSGeo/gdal@c2ffc70 |
Beta Was this translation helpful? Give feedback.
-
As I mentioned in yesterday's meeting I made a dataset that might be useful for testing here. I put it up on source.coop https://beta.source.coop/repositories/jwasserman/geoparquet-spatial-query-testing/description/. It is the November Overture Places releases, split into spatially separated files, with a bounding box column in both the nested structure format as well as top-level columns. A few more details on the source.coop page. Here's a nice proof of duckdb's lack of support for predicate pushdown in nested columns - which we already knew but is nice to see with real results on a real-world dataset. These are two identical remote queries for a small region near the Boston area. The first queries the nested The results for row-group filtering vs. non - just by switching from
Try it yourself. I'm hoping this dataset is useful for testing the other query engines and seeing how they perform reading the struct or the top-level columns.
|
Beta Was this translation helpful? Give feedback.
-
I'll look for more explicit verificationon Athena, but it looks like it is supported. I'm not an expert on interpreting these query plans but it scans about 29MB which is a similar range to the duckdb pushdown-supported query and far less than the 2GB contained in a full scan of the bbox columns.
|
Beta Was this translation helpful? Give feedback.
-
@jwass We confirmed that Spark/Databricks |
Beta Was this translation helpful? Give feedback.
-
Just checked on BigQuery and it seems predicate push down work on this context select count(1),1 as cachebuster from cartobq.maps.overtureplaces
--When using this filter it has read 1.7GB of data from GCS
where bbox.minx > -180 and bbox.miny > -90 and bbox.maxx < 180 and bbox.maxy < 90
--When using this filter it has read 15MB
where bbox.minx > -71.166345 and bbox.miny > 42.325606 and bbox.maxx < -71.050209 and bbox.maxy < 42.40206; Being |
Beta Was this translation helpful? Give feedback.
-
Yes, I just copied to Google cloud storage.
…On Thu, Jan 18, 2024 at 3:05 PM Jacob Wasserman ***@***.***> wrote:
Were the files you were using (cartobq.maps.overtureplaces) the data I
posted on source.coop (from #192 (comment)
<#192 (comment)>
)?
—
Reply to this email directly, view it on GitHub
<#192 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7GOYAI5QVNDTH4ZALSSTYPET3NAVCNFSM6AAAAABAHZINHOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DCNRZG42DA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
According to my communication with Snowflake team, Snowflake supports filter pushdown on grouped columns of parquet files. But there is partial support of this feature in Snowflake. It has the following limitations:
|
Beta Was this translation helpful? Give feedback.
-
Fyi Apache Impala supports using Parquet row group and page level min/max stats for predicates on struct members. set explain_level=2; One issue though is that at the moment it cannot write Parquet files with complex types, so using separate double columns would be more convenient. |
Beta Was this translation helpful? Give feedback.
-
To help drive the decision of the bounding box specification pull request and discussion, we want to survey existing parquet readers and query engines to understand whether they support predicate pushdown of the nested bbox definition fields (struct with member minx/miny fields).
"Predicate pushdown" is a loaded phrase that could mean different things in different contexts. For the purposes of this discussion we'll define it as the ability to use the row group summary statistics to filter out whole row groups when the proposed bbox group/struct definition is used in the filter predicate. e.g.
SELECT * FROM parquet_backed_table WHERE bbox.xmin > -71
.Please add more information or other missing systems in the discussion comments. I'll keep the list here updated as we get more information. For the ones that are "no", we might want to understand the level of effort to gain support.
duckdb - Yes (as of 0.10.0) (@Maxxen)
#191 (comment)
#192 (comment)
When the PR referenced in #191 (comment) lands, this should be supported in the next duckdb release.
gdal - Yes as of 3.8.0 (@rouault)
Per comment at GeoParquet meetup yesterday, gdal properly filters on nested fields, but does not skip whole row groupsSupport for Overture bbox definition included. #192 (comment)
arrow-cpp - Yes as of 15.0 (@jorisvandenbossche)
Gains support when apache/arrow#39065 is merged. (Update: Merged)
As of Arrow 15.0 per discussion at GeoParquet meetup
Presto / Trino / Athena - Yes (@jwasserman)
#192 (comment)
Spark / Databricks - Yes (@jiayuasu)
Per comment #192 (comment)
Sedona - Yes as of 1.5.1 (@jiayuasu)
Per comment #192 (comment)Should be supported in next release.
Supported as of Sedona 1.5.1
Snowflake Yes (@jiayuasu)
#192 (comment)
The column extract needs to be cast as e.g.
t.value:bbox:minX::double > 41.123
BigQuery - Yes (@jatorre)
Comment: #192 (comment)
Impala - Yes (@csringhofer)
Comment: #192 (comment)
Supported on read. Impala cannot yet write complex/nested types.
Synapse ??
DataFusion - No (@kylebarron)
#191 (comment)
cudf - No (@paleolimbot)
#191 (comment)
Beta Was this translation helpful? Give feedback.
All reactions