-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decision post on what to use for pre-processing / cleaning of data before storing as Parquet files or as raw data files #143
Comments
@lwjohnst86 First thoughts. Pre-cleaning tool: Processing: |
Hmm, they aren't really two things. I'll revise the title. The cleaning / pre-processing would be standard steps people usually do to tidy up data. Especially data from non-database sources like CSV. For instance, if the column headers actually start on row 5 because there are comments in the first 4 lines. Or if some of the data gets classified initially as character, but is actually an integer, etc. These activities are things I assume you've probably done a lot of when working with data. To clarify the purposes of using these tools:
These tools are not directly related to data packages, they are only related because we use data packages. These tools will also not directly be related to getting the data into the parquet files, but they will partly be involved (maybe mainly for the saving to parquet functionality). But yes, there will be some functions we may create to solve common things, like fixing dates or standardizing missing values, etc. |
For instance, what package should we use internally as well as for the example repos. Packages to look into are, for example, Polars, DuckDB, etc. Do this after starting the example repos, so that you can also test out both on the data chosen in seedcase-project/data#22.
The text was updated successfully, but these errors were encountered: