You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the conclusion of Epic #1026, the OriginalFile model will be created and integrated into the codebase.
Computed files comprise the archive of data and metadata files of a variety of samples inside of a project and a dataset. Over time, data and metadata files are updated, forcing computed files to be regenerated as the old ones are become inaccurate.
Problem or idea
We would like to know when computed files must be regenerated. Here are two possible approaches to handling this.
First Option: Collective File Hashing
One way to verify whether or not data files have been altered is to compute a collective hash of all data files associated with a project and dataset, and compare that stored value to a newly hashed values every time original files are synced with the s3 input bucket, to see if files have been altered. Should the hashes not match, then there would be a need to regenerate some computed files.
It seems like the simplest way to compute these collective hashes would be to do what AWS does. AWS creates a 33 character hash for all files that are uploaded to s3 (referring to the hash an ETAG) so that it can tell when a file has been modified. When file sizes are above a certain threshold (see above linked article for details), AWS splits files into chunks and uploads these chunks one at a time. Each chunk is hashed when it's received, and when all files are received AWS concatenates the hash values from all of the file's chunks and from it generates a new hash for the entire file (appending a -# to the end of the hash to indicate in how many chunks the file was uploaded). Concatenating all hash values of files inside a certain computed file, hashing that value and then comparing the new hash to the old hash would be a quick way for us to ascertain whether or not computed files need to be regenerated.
When to calculate the collective hashes
As touched upon above, the timing for this seems most logical at the conclusion of syncing the OriginalFile table with the s3 input bucket. The most efficient way to do this would be to hash all of a project's original files and compare them to the current hash on the project instance. If a project's files have been altered, then we should regenerate the project's computed files. As well, we could query all datasets associated with the project, and either regenerate them as well or hash their file contents and compare the new hash to their stored hash to see if they were affected.
Another Approach: Leveraging OriginalFile relations
An alternative approach would be to add an attribute called "tainted" to the OriginalFile model, which should be updated when we run the sync_original_files management command, setting OriginalFile::tainted=True if the hash from a file in s3 doesn't match the hash of that file in the DB. After the syncing of original files is complete, we should query all Projects and Datasets that are associated with the tainted original file, and regenerate all affected computed files. After the last computed file has been regenerated, then we would come back and set OriginalFile::tainted=False.
The text was updated successfully, but these errors were encountered:
avrohomgottlieb
changed the title
Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models
Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models
Dec 16, 2024
Context
At the conclusion of Epic #1026, the OriginalFile model will be created and integrated into the codebase.
Computed files comprise the archive of data and metadata files of a variety of samples inside of a project and a dataset. Over time, data and metadata files are updated, forcing computed files to be regenerated as the old ones are become inaccurate.
Problem or idea
We would like to know when computed files must be regenerated. Here are two possible approaches to handling this.
First Option: Collective File Hashing
One way to verify whether or not data files have been altered is to compute a collective hash of all data files associated with a project and dataset, and compare that stored value to a newly hashed values every time original files are synced with the s3 input bucket, to see if files have been altered. Should the hashes not match, then there would be a need to regenerate some computed files.
How AWS calculates a collective hash 1
It seems like the simplest way to compute these collective hashes would be to do what AWS does. AWS creates a 33 character hash for all files that are uploaded to s3 (referring to the hash an ETAG) so that it can tell when a file has been modified. When file sizes are above a certain threshold (see above linked article for details), AWS splits files into chunks and uploads these chunks one at a time. Each chunk is hashed when it's received, and when all files are received AWS concatenates the hash values from all of the file's chunks and from it generates a new hash for the entire file (appending a
-#
to the end of the hash to indicate in how many chunks the file was uploaded). Concatenating all hash values of files inside a certain computed file, hashing that value and then comparing the new hash to the old hash would be a quick way for us to ascertain whether or not computed files need to be regenerated.When to calculate the collective hashes
As touched upon above, the timing for this seems most logical at the conclusion of syncing the OriginalFile table with the s3 input bucket. The most efficient way to do this would be to hash all of a project's original files and compare them to the current hash on the project instance. If a project's files have been altered, then we should regenerate the project's computed files. As well, we could query all datasets associated with the project, and either regenerate them as well or hash their file contents and compare the new hash to their stored hash to see if they were affected.
Another Approach: Leveraging OriginalFile relations
An alternative approach would be to add an attribute called "tainted" to the OriginalFile model, which should be updated when we run the
sync_original_files
management command, settingOriginalFile::tainted=True
if the hash from a file in s3 doesn't match the hash of that file in the DB. After the syncing of original files is complete, we should query all Projects and Datasets that are associated with the tainted original file, and regenerate all affected computed files. After the last computed file has been regenerated, then we would come back and setOriginalFile::tainted=False
.Footnotes
https://www.linkedin.com/pulse/how-aws-s3-e-tags-work-marco-rizk-iefwf ↩
The text was updated successfully, but these errors were encountered: