Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream metadata from S3, filter, and compress #92

Merged
merged 1 commit into from
Mar 22, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Mar 21, 2024

Description of proposed changes

Instead of downloading the complete metadata files from S3, uncompressing them to disk, filtering the uncompressed file to the subset of columns we want into an uncompressed file, and deleting the full metadata, this commit proposes streaming the original metadata from S3 through the filter step and writing the subset of metadata to a compressed file.

This change replaces the vendored download-from-s3 script with aws s3 cp since the latter can stream to stdout while the former cannot because it also writes log messages to stdout.

Testing with the "open" ncov dataset, this change avoids storing an 11GB full metadata file locally and produces a 71MB compressed subset metadata file in roughly the same amount of processing time.

Related issue

Closes #80

Checklist

  • Checks pass

Instead of downloading the complete metadata files from S3,
uncompressing them to disk, filtering the uncompressed file to the
subset of columns we want into an uncompressed file, and deleting the
full metadata, this commit proposes streaming the original metadata from
S3 through the filter step and writing the subset of metadata to a
compressed file.

This change replaces the vendored `download-from-s3` script with `aws s3
cp` since the latter can stream to stdout while the former cannot
because it also writes log messages to stdout.

Testing with the "open" ncov dataset, this change avoids storing an 11GB
full metadata file locally and produces a 71MB compressed subset
metadata file in roughly the same amount of processing time.
@huddlej huddlej requested a review from joverlee521 March 21, 2024 23:07
@huddlej
Copy link
Contributor Author

huddlej commented Mar 21, 2024

@joverlee521 This is just a suggestion PR after I tried and failed to run the ingest workflow on my work laptop because I didn't have enough disk space to even process the open data. If this seems like a terrible idea for other reasons I'm naive about, don't hesitate to close this.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huddlej for digging into this!

The changes look good to me, feel free to merge if the trial run completes successfully.


I do wonder if this subsetting of metadata is useful to add to the shared ingest/download-from-s3, but as of right now I can't think of any other workflow that would use this feature.

@huddlej
Copy link
Contributor Author

huddlej commented Mar 22, 2024

Thanks, @joverlee521! The trial run looks like it ran as expected, so I will merge this. I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

@huddlej huddlej merged commit a2fa5de into main Mar 22, 2024
11 checks passed
@huddlej huddlej deleted the stream-metadata-from-s3 branch March 22, 2024 23:09
@joverlee521
Copy link
Contributor

I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

Ah yeah, I used the #scratch channel when I triggered the trial run so they wouldn't mix in with the "real" notifications.
I see they came through just fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce disk write of 20+GB metadata file by filtering on the fly
2 participants