Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

avrohomgottlieb · 2024-12-12T21:09:08Z

Context

At the conclusion of Epic #1026, the OriginalFile model will be created and integrated into the codebase.

Computed files comprise the archive of data and metadata files of a variety of samples inside of a project and a dataset. Over time, data and metadata files are updated, forcing computed files to be regenerated as the old ones are become inaccurate.

Problem or idea

We would like to know when computed files must be regenerated. Here are two possible approaches to handling this.

First Option: Collective File Hashing

One way to verify whether or not data files have been altered is to compute a collective hash of all data files associated with a project and dataset, and compare that stored value to a newly hashed values every time original files are synced with the s3 input bucket, to see if files have been altered. Should the hashes not match, then there would be a need to regenerate some computed files.

How AWS calculates a collective hash ¹

It seems like the simplest way to compute these collective hashes would be to do what AWS does. AWS creates a 33 character hash for all files that are uploaded to s3 (referring to the hash an ETAG) so that it can tell when a file has been modified. When file sizes are above a certain threshold (see above linked article for details), AWS splits files into chunks and uploads these chunks one at a time. Each chunk is hashed when it's received, and when all files are received AWS concatenates the hash values from all of the file's chunks and from it generates a new hash for the entire file (appending a -# to the end of the hash to indicate in how many chunks the file was uploaded). Concatenating all hash values of files inside a certain computed file, hashing that value and then comparing the new hash to the old hash would be a quick way for us to ascertain whether or not computed files need to be regenerated.

When to calculate the collective hashes

As touched upon above, the timing for this seems most logical at the conclusion of syncing the OriginalFile table with the s3 input bucket. The most efficient way to do this would be to hash all of a project's original files and compare them to the current hash on the project instance. If a project's files have been altered, then we should regenerate the project's computed files. As well, we could query all datasets associated with the project, and either regenerate them as well or hash their file contents and compare the new hash to their stored hash to see if they were affected.

Another Approach: Leveraging OriginalFile relations

An alternative approach would be to add an attribute called "tainted" to the OriginalFile model, which should be updated when we run the sync_original_files management command, setting OriginalFile::tainted=True if the hash from a file in s3 doesn't match the hash of that file in the DB. After the syncing of original files is complete, we should query all Projects and Datasets that are associated with the tainted original file, and regenerate all affected computed files. After the last computed file has been regenerated, then we would come back and set OriginalFile::tainted=False.

https://www.linkedin.com/pulse/how-aws-s3-e-tags-work-marco-rizk-iefwf ↩

The text was updated successfully, but these errors were encountered:

avrohomgottlieb · 2024-12-17T15:18:49Z

This issue is being closed as the discussion has been concluded. Implementation steps are described in issue #1030.

avrohomgottlieb changed the title ~~Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models~~ Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models Dec 16, 2024

avrohomgottlieb mentioned this issue Dec 16, 2024

Implement ComputedFile Hashing Mechanism #1030

Open

avrohomgottlieb closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

avrohomgottlieb commented Dec 12, 2024 •

edited

Loading

avrohomgottlieb commented Dec 17, 2024

Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

Discussion - Clarify Hashing Mechanism with OriginalFile, Project and Dataset Models #1027

Comments

avrohomgottlieb commented Dec 12, 2024 • edited Loading

Context

Problem or idea

First Option: Collective File Hashing

How AWS calculates a collective hash 1

When to calculate the collective hashes

Another Approach: Leveraging OriginalFile relations

Footnotes

avrohomgottlieb commented Dec 17, 2024

avrohomgottlieb commented Dec 12, 2024 •

edited

Loading

How AWS calculates a collective hash ¹