Bs/fix target date schema/265 #267

bsweger · 2025-01-10T22:08:40Z

Resolves #265

Background

We want Arrow to read nowcast_date and sequence_as_of colummns as a string data type rather than large_string (see issue linked above for more details).

This PR changes the method of writing both oracle and time series target data parquet files to ensure that all text columns will be read by arrow as string.

Testing

The tests in get_target_data.py now read the target data files with PyArrow to confirm the schema.

Some manual poking around confirmed this in R.

Reading a single time series file before the change (partitioned reads didn't work, which is the reason for the PR)

FileSystemDataset with 1 Parquet file
7 columns
target_date: date32[day]
location: large_string
clade: large_string
observation: uint32
nowcast_date: large_string
sequence_as_of: large_string
tree_as_of: large_string

Reading partitioned time series files after the change:

> ds <- arrow::open_dataset("/Users/rsweger/code/variant-nowcast-hub/target-data/time-series/", format="parquet")
> ds
FileSystemDataset with 2 Parquet files
7 columns
target_date: date32[day]
location: string
clade: string
observation: int32
nowcast_date: string
sequence_as_of: string
tree_as_of: string

Prior to this change, target data files were created using the Polars write_parquet method. However, that operation results in column datatypes that are interpreted by Arrow as string_large. Because we're including our partition fields in the parquet file, the large_string results in a type mismatch when R packages use arrow::open_dataset to read the files (the value in the partition key is read as a string, not a large_string)

bsweger added 2 commits January 10, 2025 16:55

Use Nextstrain 100K files for integration tests

2f86d8c

trobacker approved these changes Jan 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bs/fix target date schema/265 #267

Bs/fix target date schema/265 #267

bsweger commented Jan 10, 2025

Bs/fix target date schema/265 #267

Are you sure you want to change the base?

Bs/fix target date schema/265 #267

Conversation

bsweger commented Jan 10, 2025