Geoparquet for STAC items #114

TomAugspurger · 2022-06-30T18:51:46Z

TomAugspurger
Jun 30, 2022
Maintainer

Hi all,

I wanted to share how the Planetary Computer is using GeoParquet to support bulk operations on STAC items. The documentation and examples are at https://planetarycomputer.microsoft.com/docs/quickstarts/stac-geoparquet/, but I'll pull out a few highlights here.

The primary motivation is to support workloads that need to return a ton of STAC items. For this kind of bulk workload, accessing pages of a REST API might be too slow. For these use-cases we'll direct the user to the Parquet files.

We have one GeoParquet dataset per STAC collection. These STAC collections now include a "geoparquet-items" collection-level asset (e.g. https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a).

{
  "geoparquet-items": {
    "href": "abfs://items/sentinel-2-l2a.parquet",
    "type": "application/x-parquet",
    "roles": [
      "stac-items"
    ],
    "title": "GeoParquet STAC items",
    "description": "Snapshot of the collection's STAC items exported to GeoParquet format.",
    "msft:partition_info": {
      "is_partitioned": true,
      "partition_frequency": "W-MON"
    },
    "table:storage_options": {
      "account_name": "pcstacitems"
    }
  }
}

This links to the root of the Parquet dataset in Azure Blob Storage. For example:

In [1]: import dask.dataframe as dd
   ...: import geopandas
   ...: import planetary_computer
   ...: import pystac_client
   ...: import pandas as pd

In [2]: catalog = pystac_client.Client.open(
   ...:     "https://planetarycomputer.microsoft.com/api/stac/v1/"
   ...: )
   ...: collection = catalog.get_collection("io-lulc-9-class")
   ...:
   ...: asset = planetary_computer.sign(collection.assets["geoparquet-items"])
   ...:
   ...: df = geopandas.read_parquet(
   ...:     asset.href, storage_options=asset.extra_fields["table:storage_options"]
   ...: )
   ...: df.head()
Out[2]:
      type stac_version                                    stac_extensions        id  ...              end_datetime                                     proj:transform            start_datetime io:supercell_id
0  Feature        1.0.0  [https://stac-extensions.github.io/projection/...  05L-2017  ... 2018-01-01 00:00:00+00:00       [10.0, 0.0, 169260.0, 0.0, -10.0, 9115700.0] 2017-01-01 00:00:00+00:00             05L
1  Feature        1.0.0  [https://stac-extensions.github.io/projection/...  05M-2017  ... 2018-01-01 00:00:00+00:00  [10.0, 0.0, 166024.76778650173, 0.0, -10.0, 99... 2017-01-01 00:00:00+00:00             05M
2  Feature        1.0.0  [https://stac-extensions.github.io/projection/...  05Q-2017  ... 2018-01-01 00:00:00+00:00  [10.0, 0.0, 178900.28857004497, 0.0, -10.0, 26... 2017-01-01 00:00:00+00:00             05Q
3  Feature        1.0.0  [https://stac-extensions.github.io/projection/...  12Q-2017  ... 2018-01-01 00:00:00+00:00       [10.0, 0.0, 178910.0, 0.0, -10.0, 2657470.0] 2017-01-01 00:00:00+00:00             12Q
4  Feature        1.0.0  [https://stac-extensions.github.io/projection/...  16M-2017  ... 2018-01-01 00:00:00+00:00  [10.0, 0.0, 166023.6435927535, 0.0, -10.0, 999... 2017-01-01 00:00:00+00:00             16M

[5 rows x 18 columns]

Columns / schemas

We chose elevate all the keys under the item's .properties to top-level columns. This matches the behavior of geopandas.GeoDataFrame.from_features.

Each collection has its own parquet schema, reflecting the unique properties available on each collection's items.

Partitioning

Some datasets (like Sentinel-2-L2A) have very many STAC items. Putting all those into a single parquet file would make it awkward to work with and update, so we chose to partition large collections by time into Parquet datasets with many files.

For now, we've put the partition information under an msft: prefix. I'm hoping we can standardize that as a formal STAC extension.

Parquet dataset generation

I have a small library at https://github.com/TomAugspurger/stac-geoparquet for generating these Parquet datasets. It's pretty rough right now, but there are some building blocks that might be useful for others. In particular, there's logic for injecting dynamic assets and links that might be set by an API, and for translating the internal storage format used by pgstac to a valid STAC item. There are a few Planetary Computer specific assumptions made right now, but those should be generalizable without too much effort.

cholmes · 2022-06-30T19:01:00Z

cholmes
Jun 30, 2022
Maintainer

Love it! I do think it'd be good to properly specify the mapping from STAC JSON to geoparquet. Way back at the beginning we had a notion of an 'abstract spec', see https://github.com/radiantearth/stam-spec and that we'd have instantiations other than JSON. It was overkill so we abandoned it. But could be nice to have one place where people can see how the two translate. Could also see things like stactools being able to output geoparquet (perhaps just including your library, but giving it a bit more visibility).

0 replies

kylebarron · 2022-06-30T20:51:46Z

kylebarron
Jun 30, 2022
Maintainer

This is awesome! Thanks for working on this!

Spatial Partitioning

For a dataset like Sentinel 2, my first thought is to agree with you that the first level of partitioning should be time-based, because it's hard to append to a parquet dataset when the primary sort is spatial.

It looks like you're currently partitioning by week? I can't remember how many items that is per partition, but it could be worthwhile to apply a secondary spatial sort to the data within each partition?

In that case it would also be worthwhile to bring bbox elements into individual parquet columns so that the parquet statistics can filter out row groups that don't overlap with the bbox of interest.

2 replies

TomAugspurger Jul 1, 2022
Maintainer Author

It looks like you're currently partitioning by week? I can't remember how many items that is per partition, but it could be worthwhile to apply a secondary spatial sort to the data within each partition?

It looks like most of the recent weeks have ~80,000 items per partition. I just used the defaults from PyArrow for all the compression and type inference, and the recent files are ~90-100 MiB on disk.

catalog = pystac_client.Client.open("https://planetarycomputer-staging.microsoft.com/api/stac/v1/")
collection = catalog.get_collection("sentinel-2-l2a")
asset = planetary_computer.sign(collection.assets["geoparquet-items"])
df = dd.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])
lengths = df["type"].map_partitions(len).compute()
sns.displot(lengths)

Agreed that enabling efficient spatial access within the time-based partitions would be great. IIUC, there's still work to be done on the metadata and software side for that (#13 and other issues)?

kylebarron Jul 2, 2022
Maintainer

and the recent files are ~90-100 MiB on disk

Ok good to know. So depending on your row group size, spatial sorting might still be valuable if a user wanted a relatively local area.

Details

Interesting.. I didn't know how to actually read the abfs:// link

IIUC, there's still work to be done on the metadata and software side for that

I think there's still work to do before anything is adopted as part of the spec, but prototypes would be valuable to discuss.

I started to explore spatial partitions in this repo (one day will become a blog post). I think especially for a dataset like sentinel 2 where the geometries are relatively small and generally spread out, a hilbert sort might work well. If you bring out each of the STAC Item's bbox columns as one Parquet column, instead of a nested list, then you get the bbox statistics out of the box for each row group for free.

m-mohr · 2022-10-25T22:22:15Z

m-mohr
Oct 25, 2022
Collaborator

Interesting idea! I'm wondering why you've chosen to include the highly repetive columns such as type? Is it negligible due to good compression?

4 replies

kylebarron Oct 25, 2022
Maintainer

With the right encoding (e.g. dictionary or maybe rle) a column of constant string values should take up very few bytes.

TomAugspurger Oct 26, 2022
Maintainer Author

I think my hope was that including type would make it easier to convert from this format back to STAC items. In practice, ends up being somewhat complicated to go from a geopandas.GeoDataFrame of these to STAC anyway, so it might make sense to just leave these static columns out of the parquet files.

As Kyle mentioned these columns should be small on disk. But if you load them into a DataFrame they'll (by default) take up much more memory.

kylebarron Oct 26, 2022
Maintainer

But if you load them into a DataFrame they'll (by default) take up much more memory.

Does that depend on the encoding of how they're stored? I would think that if they're stored in parquet with dictionary encoding, then they'd likely be loaded into dataframe libraries like pandas as categoricals?

kylebarron Oct 26, 2022
Maintainer

See this pandas example where a categorical is saved and re-loaded as a categorical (so in memory 0 is stored for each row, not 'Feature').

import pandas as pd
stac_types = ['Feature'] * 100
df = pd.DataFrame({'type': pd.Categorical(stac_types)})
df.to_parquet('categorical.parquet')
pd.read_parquet('categorical.parquet')['type'].dtype
# CategoricalDtype(categories=['Feature'], ordered=False)

This isn't just due to pandas' parquet metadata; it's actually stored as a dictionary type (actually this might be the arrow metadata; would need to double check):

import pyarrow.parquet as pq
table = pq.read_table('categorical.parquet')
table.schema.field('type')
# pyarrow.Field<type: dictionary<values=string, indices=int32, ordered=0>>

So even if you load it into another dataframe library (that supports categoricals) like polars, it's still loaded correctly (without expanding into a huge string column):

import polars as pl
df = pl.read_parquet('categorical.parquet')
df['type'].dtype
# polars.datatypes.Categorical
df.estimated_size()
# 423

ivanhigueram · 2023-03-17T21:58:51Z

ivanhigueram
Mar 17, 2023

This is amazing Tom!

I am wondering if there's any tool to do interoperable operations between the parquet rows and pystac.Item objects. I created a function of my own following the Pystac tutorial but was wondering if there's something more robust.

1 reply

TomAugspurger Mar 19, 2023
Maintainer Author

Some discussion at microsoft/PlanetaryComputer#200 (comment), for those following along.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geoparquet for STAC items #114

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Geoparquet for STAC items #114

TomAugspurger Jun 30, 2022 Maintainer

Columns / schemas

Partitioning

Parquet dataset generation

Replies: 4 comments · 7 replies

cholmes Jun 30, 2022 Maintainer

kylebarron Jun 30, 2022 Maintainer

Spatial Partitioning

TomAugspurger Jul 1, 2022 Maintainer Author

kylebarron Jul 2, 2022 Maintainer

m-mohr Oct 25, 2022 Collaborator

kylebarron Oct 25, 2022 Maintainer

TomAugspurger Oct 26, 2022 Maintainer Author

kylebarron Oct 26, 2022 Maintainer

kylebarron Oct 26, 2022 Maintainer

ivanhigueram Mar 17, 2023

TomAugspurger Mar 19, 2023 Maintainer Author

TomAugspurger
Jun 30, 2022
Maintainer

Replies: 4 comments 7 replies

cholmes
Jun 30, 2022
Maintainer

kylebarron
Jun 30, 2022
Maintainer

TomAugspurger Jul 1, 2022
Maintainer Author

kylebarron Jul 2, 2022
Maintainer

m-mohr
Oct 25, 2022
Collaborator

kylebarron Oct 25, 2022
Maintainer

TomAugspurger Oct 26, 2022
Maintainer Author

kylebarron Oct 26, 2022
Maintainer

kylebarron Oct 26, 2022
Maintainer

ivanhigueram
Mar 17, 2023

TomAugspurger Mar 19, 2023
Maintainer Author