Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

Closed
avrohomgottlieb opened this issue Dec 12, 2024 · 1 comment

Comments

@avrohomgottlieb
Copy link
Contributor

avrohomgottlieb commented Dec 12, 2024

Context

At the conclusion of Epic #1026, the OriginalFile model will be created and integrated into the codebase.

Computed files comprise the archive of data and metadata files of a variety of samples inside of a project and a dataset. Over time, data and metadata files are updated, forcing computed files to be regenerated as the old ones are become inaccurate.

Problem or idea

We would like to know when computed files must be regenerated. Here are two possible approaches to handling this.

First Option: Collective File Hashing

One way to verify whether or not data files have been altered is to compute a collective hash of all data files associated with a project and dataset, and compare that stored value to a newly hashed values every time original files are synced with the s3 input bucket, to see if files have been altered. Should the hashes not match, then there would be a need to regenerate some computed files.

How AWS calculates a collective hash 1

It seems like the simplest way to compute these collective hashes would be to do what AWS does. AWS creates a 33 character hash for all files that are uploaded to s3 (referring to the hash an ETAG) so that it can tell when a file has been modified. When file sizes are above a certain threshold (see above linked article for details), AWS splits files into chunks and uploads these chunks one at a time. Each chunk is hashed when it's received, and when all files are received AWS concatenates the hash values from all of the file's chunks and from it generates a new hash for the entire file (appending a -# to the end of the hash to indicate in how many chunks the file was uploaded). Concatenating all hash values of files inside a certain computed file, hashing that value and then comparing the new hash to the old hash would be a quick way for us to ascertain whether or not computed files need to be regenerated.

When to calculate the collective hashes

As touched upon above, the timing for this seems most logical at the conclusion of syncing the OriginalFile table with the s3 input bucket. The most efficient way to do this would be to hash all of a project's original files and compare them to the current hash on the project instance. If a project's files have been altered, then we should regenerate the project's computed files. As well, we could query all datasets associated with the project, and either regenerate them as well or hash their file contents and compare the new hash to their stored hash to see if they were affected.

Another Approach: Leveraging OriginalFile relations

An alternative approach would be to add an attribute called "tainted" to the OriginalFile model, which should be updated when we run the sync_original_files management command, setting OriginalFile::tainted=True if the hash from a file in s3 doesn't match the hash of that file in the DB. After the syncing of original files is complete, we should query all Projects and Datasets that are associated with the tainted original file, and regenerate all affected computed files. After the last computed file has been regenerated, then we would come back and set OriginalFile::tainted=False.

Footnotes

  1. https://www.linkedin.com/pulse/how-aws-s3-e-tags-work-marco-rizk-iefwf

@avrohomgottlieb avrohomgottlieb changed the title Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models Dec 16, 2024
@avrohomgottlieb
Copy link
Contributor Author

This issue is being closed as the discussion has been concluded. Implementation steps are described in issue #1030.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant