Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143

Closed
lwjohnst86 opened this issue Nov 25, 2024 · 2 comments · Fixed by #145
Assignees

Comments

@lwjohnst86
Copy link
Member

lwjohnst86 commented Nov 25, 2024

For instance, what package should we use internally as well as for the example repos. Packages to look into are, for example, Polars, DuckDB, etc. Do this after starting the example repos, so that you can also test out both on the data chosen in seedcase-project/data#22.

@lwjohnst86 lwjohnst86 converted this from a draft issue Nov 25, 2024
@K-Beicher
Copy link
Contributor

K-Beicher commented Nov 26, 2024

@lwjohnst86 First thoughts.

Pre-cleaning tool:
This will mainly check that the data the user is trying to upload matches the stated definitions in the data package file. The tool won't necessarily be able to clean data, but it will be able to point out if a column contains values that are not compatible with the meta-data defined data type (fx TBC being uploaded against the numerical Height).

Processing:
Can include pre-cleaning, but is more concerned with how to get the data into the parquet files, including whether or not to assign standard values to missing items, or streamlining date fields etc. Unless of course we don't consider transformation of data to be part of what Sprout will do on inclusion of data?

@lwjohnst86
Copy link
Member Author

Hmm, they aren't really two things. I'll revise the title.

The cleaning / pre-processing would be standard steps people usually do to tidy up data. Especially data from non-database sources like CSV. For instance, if the column headers actually start on row 5 because there are comments in the first 4 lines. Or if some of the data gets classified initially as character, but is actually an integer, etc. These activities are things I assume you've probably done a lot of when working with data.

To clarify the purposes of using these tools:

  • Before entering data into Sprout, the user will almost certainly have to do some pre-processing. So this decision record is partly about deciding which tool do we show in our examples/docs and that we recommend using.
  • For building the Sprout extensions for pre-processing certain types of standard data (e.g. from a specific lab machine), we will need to use a tool for doing that work. This decision post would inform us on which tool to use when adding to our extensions.

These tools are not directly related to data packages, they are only related because we use data packages. These tools will also not directly be related to getting the data into the parquet files, but they will partly be involved (maybe mainly for the saving to parquet functionality). But yes, there will be some functions we may create to solve common things, like fixing dates or standardizing missing values, etc.

@lwjohnst86 lwjohnst86 changed the title Decision post on what to use for processing and pre-cleaning data Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files Nov 26, 2024
@K-Beicher K-Beicher linked a pull request Dec 4, 2024 that will close this issue
3 tasks
@K-Beicher K-Beicher moved this from In Progress to In Review in Team project planning Dec 4, 2024
@K-Beicher K-Beicher moved this from In Review to In Progress in Team project planning Dec 6, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Team project planning Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants