Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bs/fix target date schema/265 #267

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Jan 10, 2025

Resolves #265

Background

We want Arrow to read nowcast_date and sequence_as_of colummns as a string data type rather than large_string (see issue linked above for more details).

This PR changes the method of writing both oracle and time series target data parquet files to ensure that all text columns will be read by arrow as string.

Testing

The tests in get_target_data.py now read the target data files with PyArrow to confirm the schema.

Some manual poking around confirmed this in R.

Reading a single time series file before the change (partitioned reads didn't work, which is the reason for the PR)

FileSystemDataset with 1 Parquet file
7 columns
target_date: date32[day]
location: large_string
clade: large_string
observation: uint32
nowcast_date: large_string
sequence_as_of: large_string
tree_as_of: large_string

Reading partitioned time series files after the change:

> ds <- arrow::open_dataset("/Users/rsweger/code/variant-nowcast-hub/target-data/time-series/", format="parquet")
> ds
FileSystemDataset with 2 Parquet files
7 columns
target_date: date32[day]
location: string
clade: string
observation: int32
nowcast_date: string
sequence_as_of: string
tree_as_of: string

Prior to this change, target data files were created using the Polars
write_parquet method. However, that operation results in column datatypes
that are interpreted by Arrow as string_large. Because we're including our
partition fields in the parquet file, the large_string results in a type
mismatch when R packages use arrow::open_dataset to read the files (the value
in the partition key is read as a string, not a large_string)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve string vs large_string type error when opening target data files in R
2 participants