Stream metadata from S3, filter, and compress #92

huddlej · 2024-03-21T23:07:41Z

Description of proposed changes

Instead of downloading the complete metadata files from S3, uncompressing them to disk, filtering the uncompressed file to the subset of columns we want into an uncompressed file, and deleting the full metadata, this commit proposes streaming the original metadata from S3 through the filter step and writing the subset of metadata to a compressed file.

This change replaces the vendored download-from-s3 script with aws s3 cp since the latter can stream to stdout while the former cannot because it also writes log messages to stdout.

Testing with the "open" ncov dataset, this change avoids storing an 11GB full metadata file locally and produces a 71MB compressed subset metadata file in roughly the same amount of processing time.

Related issue

Closes #80

Checklist

Checks pass

Instead of downloading the complete metadata files from S3, uncompressing them to disk, filtering the uncompressed file to the subset of columns we want into an uncompressed file, and deleting the full metadata, this commit proposes streaming the original metadata from S3 through the filter step and writing the subset of metadata to a compressed file. This change replaces the vendored `download-from-s3` script with `aws s3 cp` since the latter can stream to stdout while the former cannot because it also writes log messages to stdout. Testing with the "open" ncov dataset, this change avoids storing an 11GB full metadata file locally and produces a 71MB compressed subset metadata file in roughly the same amount of processing time.

huddlej · 2024-03-21T23:09:15Z

@joverlee521 This is just a suggestion PR after I tried and failed to run the ingest workflow on my work laptop because I didn't have enough disk space to even process the open data. If this seems like a terrible idea for other reasons I'm naive about, don't hesitate to close this.

joverlee521

Thanks @huddlej for digging into this!

The changes look good to me, feel free to merge if the trial run completes successfully.

I do wonder if this subsetting of metadata is useful to add to the shared ingest/download-from-s3, but as of right now I can't think of any other workflow that would use this feature.

huddlej · 2024-03-22T23:09:19Z

Thanks, @joverlee521! The trial run looks like it ran as expected, so I will merge this. I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

joverlee521 · 2024-03-22T23:44:33Z

I don't see a notification in the #nextstrain-counts-updates Slack channel, but maybe that's ok?

Ah yeah, I used the #scratch channel when I triggered the trial run so they wouldn't mix in with the "real" notifications.
I see they came through just fine!

huddlej requested a review from joverlee521 March 21, 2024 23:07

joverlee521 approved these changes Mar 22, 2024

View reviewed changes

huddlej merged commit a2fa5de into main Mar 22, 2024
11 checks passed

huddlej deleted the stream-metadata-from-s3 branch March 22, 2024 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream metadata from S3, filter, and compress #92

Stream metadata from S3, filter, and compress #92

huddlej commented Mar 21, 2024 •

edited

Loading

huddlej commented Mar 21, 2024

joverlee521 left a comment

huddlej commented Mar 22, 2024

joverlee521 commented Mar 22, 2024

Stream metadata from S3, filter, and compress #92

Stream metadata from S3, filter, and compress #92

Conversation

huddlej commented Mar 21, 2024 • edited Loading

Description of proposed changes

Related issue

Checklist

huddlej commented Mar 21, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

huddlej commented Mar 22, 2024

joverlee521 commented Mar 22, 2024

huddlej commented Mar 21, 2024 •

edited

Loading