Add sampling task #3657

bensteinberg · 2024-11-15T18:15:00Z

This is an experiment in asserting completeness of a backup of Perma WARC and WACZ files. We've done
something like this before, by enumeration; one brute-force approach is to enumerate all objects in the S3 buckets and record their etags, then see if any are not present on the mirror, or present but not with a matching etag.

Another approach is to enumerate, from Perma's database, all the artifacts that should be present, get their etags from S3, etc. Enumeration obviously becomes a bigger and bigger task as the collection grows, involves redoing work, and I think is somewhat fragile.

My idea is to assert completeness statistically. My first basic notion was something like "The null hypothesis is that at least 1 in x files do not make it to the mirror. To disprove the null hypothesis, you'd sample n in x captured files and, not seeing a failure, show that the chance of not seeing a failure in the sample, given the null hypothesis, falls below the typical value, a little under two percent."

On rereading part of my intro stats book, I think a better way of putting this is in terms of a sampling distribution model for a proportion. The proportion here is the proportion of artifacts that are not present on the mirror; a sampling distribution is a simulation, the distribution of the proportions from all possible samples. The basic idea is that it is normally distributed and so can be characterized like this:

"Provided that the sampled values are independent and the sample size (n) is large enough, the sampling distribution model of p̂ (observed proportion in the sample) is modeled by a Normal model with mean(p̂) = p (model parameter aka hypothesized proportion) and standard deviation of p̂ = the square root of (p * (1 - p)) / n."

The provisos (independent, large enough) translate to the randomization condition (sample is chosen randomly), the 10% condition (sample size is less than 10% of the population), and the success/failure condition (the sample size has to be big enough to expect at least ten successes and ten failures). That is, we can't test for a failure rate of one in five million (roughly, the population size), as we couldn't generate a sample size large enough in any case, let alone keeping it under 10% of the population.

A tenth of five million is 500,000 -- that seems like way too many to test. Let's say n is 1,000, well under 10% of the population. Then, let's pick proportions for success and failure that will produce ten failures; 99% success rate and 1% failure rate would produce ten failures and 990 successes in a sample of 1,000. The standard
deviation is the square root of ((0.99 * 0.01) / 1000), which is 0.00314642654451 or about a third of a percentage point (our unit of measure).

I expect that in a random sample of 1,000 links, the proportion of failures will be at least two standard deviations below the nominal 1% rate we're assuming here -- that is, that less than 2.5% of the time, a sample of this size drawn from a population with a true failure proportion of 1% would show fewer than four failures. In fact, I'm expecting our random sample to have one or zero failures, with only a one percent chance of occurring if the population actually had a 1% failure rate.

This PR adds an invoke task to generate a sample of Links, outputting object paths and etags. It generates a Python script to be run on the mirror, including the sample. That script compares the items in the sample with what is on disk, and reports successes, failures, standard deviation, and z-score; I think a second cut at this will add the probability of the seeing the proportion in the sample, probably using NormalDist from statistics, but I want to sanity-check this with a probability table first. :)

I am not sure about the pattern of generating a Python script as output; if that seems weird, I could generate the sample as JSON and keep the script somewhere else, maybe in services/ in this repo. I'm not even that sure about generating the data and putting it in /tmp, but we do that in other tasks.

It is quite possible this is wrong. If any of this is confusing, please let me know!

codecov · 2024-11-15T18:25:47Z

Codecov Report

Attention: Patch coverage is 0% with 59 lines in your changes missing coverage. Please review.

Project coverage is 69.08%. Comparing base (2f4f552) to head (6ccf8aa).
Report is 7 commits behind head on develop.

Files with missing lines	Patch %	Lines
perma_web/tasks/dev.py	0.00%	59 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3657      +/-   ##
===========================================
- Coverage    69.64%   69.08%   -0.56%     
===========================================
  Files           54       54              
  Lines         7339     7398      +59     
===========================================
  Hits          5111     5111              
- Misses        2228     2287      +59

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

rebeccacremona · 2024-11-15T18:27:42Z

perma_web/tasks/dev.py

+        ).e_tag.strip('"')
+
+    # obtain a sample of size n
+    links = Link.objects.filter(cached_can_play_back=True).order_by("?")[:n]


I know we talked about cached_can_play_back=True offline, but in working on #3656, I realized that we DO in fact have a set of links where cached_can_play_back is wrong. Those links need their metadata fixed. They will cause problems here.

Ah, thanks! I'll take a look at other options for filtering.

I think we may have a few WACZ files where wacz_size is zero or not set?

not set if this true, I can't account for it, and I think it's a bug

zero when uploading a replacement WARC, the wacz_size field is zeroed, since the WACZ has been superseded.

Hmmm, this is an interesting challenge.

So the mirror might well have a WACZ file that the application doesn't consider a thing anymore. I think the WACZ in question would remain in the bucket as well? I'm not sure this makes a difference for this task; probably Perma's sense of what exists should be the ground truth, not what's in the bucket.

Right, when things are replaced, the object remains in the bucket, I think on the principle that deleting things from the bucket is dangerous.

rebeccacremona

This is amazing, thanks for the thorough explanation of the math!

As mentioned, I think either this doesn't want to be run until we fix the Perma Links that have an incorrect value for cached_can_play_back, or, this code wants to be switched to instead check any links where either warc_size or wacz_size is truthy. That may be more correct: cached_can_play_back is also False if link.user_deleted... and I'm not sure, but I think deleted archives will still be backed up to the mirror, in the normal course of business.

Code looks great! I mentioned a utility function a person could use for calculating etags.

I have no opinion on the propriety or impropriety of generating a script to run.... Could be a fun topic for discussion! Since the script in question would be read-only, strikes me as low-risk.

rebeccacremona · 2024-11-15T18:32:11Z

perma_web/tasks/dev.py

+            if o[archive]["path"]:
+                for d in directories:
+                    full_path = Path(d) / "generated" / o[archive]["path"]
+                    if full_path.exists():


We have a utility function for calculating etags now, if you want:

perma/perma_web/perma/utils.py

Line 574 in 06d7f73

def calculate_s3_etag(fp, chunk_size, multipart_format=False):

It handles multipart uploads, which... I don't think we have any of, but I am not sure I remember.

More on that:

perma/perma_web/perma/celery_tasks.py

Line 1403 in 06d7f73

# retrieve S3's md5-hash-like "etag" of file it has

Yes, I meant to say something in the PR description about the possibility of multipart uploads, but forgot. Thanks. The reason I'm doing this inline and not using the function in utils is that this code will not run in Perma's context; though it strikes me that in sample_objects I might be able to

from perma.utils import calculate_s3_etag # and then when outputting the script # ... f.write(inspect.getsource(calculate_s3_etag)) # ...

Let me try that, I think....

Ah, of course...

bensteinberg · 2024-11-18T15:03:09Z

Separately, I've been considering removing the HistoricalLink model, as it is enormous, never gets used, and is unwieldy for backups and database upgrades -- but it strikes me that if the question is not "What links does Perma think have WARC or WACZ files" but "What links does Perma think have ever had WARC or WACZ files", the answer might be in HistoricalLink.

bensteinberg · 2024-11-19T15:01:01Z

After discussion on Slack, I'm hoping to get links and objects in the same way, but additionally get replaced objects by getting links.filter(capture_job__superseded=True) and finding the old object names. The old object is named something like W4LT-DTQ4_replaced_1721842551.278022.warc.gz but I have not been able to generate that timestamp from any time recorded in the Link object. I also think it's possible that there could be more than one replaced file?

bensteinberg · 2024-11-19T15:14:55Z

Ah yes, @rebeccacremona has already pointed out

I think the thing our database maybe cannot tell us is just how many replaced warcs there might be for any given link

bensteinberg requested a review from a team as a code owner November 15, 2024 18:15

bensteinberg requested review from rebeccacremona and removed request for a team November 15, 2024 18:15

rebeccacremona reviewed Nov 15, 2024

View reviewed changes

bensteinberg added 5 commits November 19, 2024 14:21

Add sampling task

9294961

Use calculate_s3_tag from utils, move imports

1d9ed2a

Keep flake8 happy

59ba23b

Filter by artifact size

0c2bf43

Go back to cached_can_play_back

6ccf8aa

bensteinberg force-pushed the sampling branch from f5bb51f to 6ccf8aa Compare November 19, 2024 19:23

rebeccacremona approved these changes Nov 19, 2024

View reviewed changes

bensteinberg merged commit 37601c3 into harvard-lil:develop Nov 19, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sampling task #3657

Add sampling task #3657

bensteinberg commented Nov 15, 2024

codecov bot commented Nov 15, 2024 •

edited

Loading

rebeccacremona Nov 15, 2024

bensteinberg Nov 15, 2024

bensteinberg Nov 15, 2024

rebeccacremona Nov 15, 2024

bensteinberg Nov 15, 2024

rebeccacremona Nov 15, 2024

rebeccacremona left a comment

rebeccacremona Nov 15, 2024

rebeccacremona Nov 15, 2024

bensteinberg Nov 15, 2024 •

edited

Loading

rebeccacremona Nov 15, 2024

bensteinberg commented Nov 18, 2024

bensteinberg commented Nov 19, 2024

bensteinberg commented Nov 19, 2024

Add sampling task #3657

Add sampling task #3657

Conversation

bensteinberg commented Nov 15, 2024

codecov bot commented Nov 15, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rebeccacremona left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bensteinberg Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bensteinberg commented Nov 18, 2024

bensteinberg commented Nov 19, 2024

bensteinberg commented Nov 19, 2024

codecov bot commented Nov 15, 2024 •

edited

Loading

bensteinberg Nov 15, 2024 •

edited

Loading