Skip to content

Application for streaming contents from s3 objects while hashing to generate md5 checksum and AWS etag

Notifications You must be signed in to change notification settings

obenshaindw/S3_Stream

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

####Motivation: s3cmd and aws-cli tools do not calculate etag value for your download. If either errors out, you will have a partially truncated file and possibly something in stderr. If you’re running large jobs (i.e., accessing 10s of thousands of objects) and you’re accessing them as a stream (i.e., everything in memory), it will be easy to miss these errors. This script calculates etag and checksum on the file stream directly, producing .etag and .md5 files for the s3 object. It also compares the calculated etag against Amazon’s etag for ensuring consistency.

####Features:

  • makes requests of s3 API in chunks, with the chunksize based on the number of parts the object has (typically 15 Mb each),
  • hashes each chunk in order to reproduce the amazon etag value,
  • hashes the full stream to generate the file’s md5 checksum,
  • downloads the file with (-o, --output-file),
  • streams the file (with - or –stdin).

####Requirements:

  • Besides needing some extra python modules (boto, progressbar, -- the rest come with Python I think).
  • You will need a .boto file in your home directory, with your aws keys.

####Could use some additional improvements: #####FIXME:

  • Fix retry mechanism for handling errors when making requests of the API

About

Application for streaming contents from s3 objects while hashing to generate md5 checksum and AWS etag

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages