Geoparquet for STAC items #114
Replies: 4 comments 7 replies
-
Love it! I do think it'd be good to properly specify the mapping from STAC JSON to geoparquet. Way back at the beginning we had a notion of an 'abstract spec', see https://github.com/radiantearth/stam-spec and that we'd have instantiations other than JSON. It was overkill so we abandoned it. But could be nice to have one place where people can see how the two translate. Could also see things like stactools being able to output geoparquet (perhaps just including your library, but giving it a bit more visibility). |
Beta Was this translation helpful? Give feedback.
-
This is awesome! Thanks for working on this! Spatial PartitioningFor a dataset like Sentinel 2, my first thought is to agree with you that the first level of partitioning should be time-based, because it's hard to append to a parquet dataset when the primary sort is spatial. It looks like you're currently partitioning by week? I can't remember how many items that is per partition, but it could be worthwhile to apply a secondary spatial sort to the data within each partition? In that case it would also be worthwhile to bring |
Beta Was this translation helpful? Give feedback.
-
Interesting idea! I'm wondering why you've chosen to include the highly repetive columns such as type? Is it negligible due to good compression? |
Beta Was this translation helpful? Give feedback.
-
This is amazing Tom! I am wondering if there's any tool to do interoperable operations between the parquet rows and |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I wanted to share how the Planetary Computer is using GeoParquet to support bulk operations on STAC items. The documentation and examples are at https://planetarycomputer.microsoft.com/docs/quickstarts/stac-geoparquet/, but I'll pull out a few highlights here.
The primary motivation is to support workloads that need to return a ton of STAC items. For this kind of bulk workload, accessing pages of a REST API might be too slow. For these use-cases we'll direct the user to the Parquet files.
We have one GeoParquet dataset per STAC collection. These STAC collections now include a "geoparquet-items" collection-level asset (e.g. https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a).
This links to the root of the Parquet dataset in Azure Blob Storage. For example:
Columns / schemas
We chose elevate all the keys under the item's
.properties
to top-level columns. This matches the behavior ofgeopandas.GeoDataFrame.from_features
.Each collection has its own parquet schema, reflecting the unique properties available on each collection's items.
Partitioning
Some datasets (like Sentinel-2-L2A) have very many STAC items. Putting all those into a single parquet file would make it awkward to work with and update, so we chose to partition large collections by time into Parquet datasets with many files.
For now, we've put the partition information under an
msft:
prefix. I'm hoping we can standardize that as a formal STAC extension.Parquet dataset generation
I have a small library at https://github.com/TomAugspurger/stac-geoparquet for generating these Parquet datasets. It's pretty rough right now, but there are some building blocks that might be useful for others. In particular, there's logic for injecting dynamic assets and links that might be set by an API, and for translating the internal storage format used by pgstac to a valid STAC item. There are a few Planetary Computer specific assumptions made right now, but those should be generalizable without too much effort.
Beta Was this translation helpful? Give feedback.
All reactions