Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow parquet and compressed csv files in DB->Snowflake replication #480

Open
nixent opened this issue Jan 7, 2025 · 3 comments
Open

Allow parquet and compressed csv files in DB->Snowflake replication #480

nixent opened this issue Jan 7, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@nixent
Copy link

nixent commented Jan 7, 2025

Feature Description

DB->Snowflake replication processes data in 3 steps:

  • export from the DB to local csv files
  • PUT the csv files to internal stage in Snowflake
  • Insert/Upsert of the records from the csv files

It works fine, however, csv files can get pretty large and the file transfer might take significant time. Suggest adding option to compress the files and/or store them in parquet format.

@nixent nixent added the enhancement New feature or request label Jan 7, 2025
@flarco
Copy link
Collaborator

flarco commented Jan 7, 2025

The issue with parquet is that it can take alot of memory, while csv streams. it's worth an experiment.
The temp CSVs should be compressed, though (with zstd). Can you confirm that?
Do you have any non-parquet suggestions? This is the fastest I can think of to load into snowflake. You can use S3 as a temp storage, but it's not going to speed it up. The other route could be snowpipe streaming, but that's a non-starter as it requires alot of setup.

@nixent
Copy link
Author

nixent commented Jan 8, 2025

parquet vs csv streaming: it would be nice to let the user to decide if they are willing to allocate more resources to the EL process.

compression:
Snowflake supports multiple COMPRESSION formatTypeOptions:
COMPRESSION = AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE
So GZIP or ZSTD would work.

resource utilization:
I'm wondering if total resource utilization would be different in case of writing csv to disc and compressing it vs writing parquet. Parquet is essentially a csv compressed by columns, thus it should take more or less same resources to produce both, parquet and compressed csv. Limiting number of records per parquet file will with capping memory/cpu utilization, same way as parquet target.

@flarco
Copy link
Collaborator

flarco commented Jan 8, 2025

it would be nice to let the user to decide if they are willing to allocate more resources to the EL process.

Agreed.

I'm wondering if total resource utilization would be different in case of writing csv to disc and compressing it vs writing parquet.

Perhaps... But yea, just read a thread and once inside the internal staging, parquet will be faster to load by snowflake engine due to columnar/ binary nature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants