Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

lwjohnst86 · 2024-11-25T20:01:14Z

For instance, what package should we use internally as well as for the example repos. Packages to look into are, for example, Polars, DuckDB, etc. Do this after starting the example repos, so that you can also test out both on the data chosen in seedcase-project/data#22.

K-Beicher · 2024-11-26T12:58:06Z

@lwjohnst86 First thoughts.

Pre-cleaning tool:
This will mainly check that the data the user is trying to upload matches the stated definitions in the data package file. The tool won't necessarily be able to clean data, but it will be able to point out if a column contains values that are not compatible with the meta-data defined data type (fx TBC being uploaded against the numerical Height).

Processing:
Can include pre-cleaning, but is more concerned with how to get the data into the parquet files, including whether or not to assign standard values to missing items, or streamlining date fields etc. Unless of course we don't consider transformation of data to be part of what Sprout will do on inclusion of data?

lwjohnst86 · 2024-11-26T13:10:34Z

Hmm, they aren't really two things. I'll revise the title.

The cleaning / pre-processing would be standard steps people usually do to tidy up data. Especially data from non-database sources like CSV. For instance, if the column headers actually start on row 5 because there are comments in the first 4 lines. Or if some of the data gets classified initially as character, but is actually an integer, etc. These activities are things I assume you've probably done a lot of when working with data.

To clarify the purposes of using these tools:

Before entering data into Sprout, the user will almost certainly have to do some pre-processing. So this decision record is partly about deciding which tool do we show in our examples/docs and that we recommend using.
For building the Sprout extensions for pre-processing certain types of standard data (e.g. from a specific lab machine), we will need to use a tool for doing that work. This decision post would inform us on which tool to use when adding to our extensions.

These tools are not directly related to data packages, they are only related because we use data packages. These tools will also not directly be related to getting the data into the parquet files, but they will partly be involved (maybe mainly for the saving to parquet functionality). But yes, there will be some functions we may create to solve common things, like fixing dates or standardizing missing values, etc.

lwjohnst86 added this to Team project planning Nov 25, 2024

lwjohnst86 assigned K-Beicher Nov 25, 2024

lwjohnst86 converted this from a draft issue Nov 25, 2024

lwjohnst86 changed the title ~~Decision post on what to use for processing and pre-cleaning data~~ Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files Nov 26, 2024

K-Beicher linked a pull request Dec 4, 2024 that will close this issue

docs: 📝 decision post on a data processing tool #145

Merged

3 tasks

K-Beicher moved this from In Progress to In Review in Team project planning Dec 4, 2024

K-Beicher moved this from In Review to In Progress in Team project planning Dec 6, 2024

lwjohnst86 closed this as completed in #145 Dec 12, 2024

github-project-automation bot moved this from In Progress to Done in Team project planning Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

lwjohnst86 commented Nov 25, 2024 •

edited

Loading

K-Beicher commented Nov 26, 2024 •

edited

Loading

lwjohnst86 commented Nov 26, 2024

Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

Comments

lwjohnst86 commented Nov 25, 2024 • edited Loading

K-Beicher commented Nov 26, 2024 • edited Loading

lwjohnst86 commented Nov 26, 2024

lwjohnst86 commented Nov 25, 2024 •

edited

Loading

K-Beicher commented Nov 26, 2024 •

edited

Loading