Stream metadata from S3, filter, and compress #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Instead of downloading the complete metadata files from S3, uncompressing them to disk, filtering the uncompressed file to the subset of columns we want into an uncompressed file, and deleting the full metadata, this commit proposes streaming the original metadata from S3 through the filter step and writing the subset of metadata to a compressed file.
This change replaces the vendored
download-from-s3
script withaws s3 cp
since the latter can stream to stdout while the former cannot because it also writes log messages to stdout.Testing with the "open" ncov dataset, this change avoids storing an 11GB full metadata file locally and produces a 71MB compressed subset metadata file in roughly the same amount of processing time.
Related issue
Closes #80
Checklist