-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sampling task #3657
Add sampling task #3657
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3657 +/- ##
===========================================
- Coverage 69.64% 69.08% -0.56%
===========================================
Files 54 54
Lines 7339 7398 +59
===========================================
Hits 5111 5111
- Misses 2228 2287 +59 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
).e_tag.strip('"') | ||
|
||
# obtain a sample of size n | ||
links = Link.objects.filter(cached_can_play_back=True).order_by("?")[:n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we talked about cached_can_play_back=True
offline, but in working on #3656, I realized that we DO in fact have a set of links where cached_can_play_back
is wrong. Those links need their metadata fixed. They will cause problems here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! I'll take a look at other options for filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may have a few WACZ files where wacz_size is zero or not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not set
if this true, I can't account for it, and I think it's a bug
zero
when uploading a replacement WARC, the wacz_size
field is zeroed, since the WACZ has been superseded.
Hmmm, this is an interesting challenge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the mirror might well have a WACZ file that the application doesn't consider a thing anymore. I think the WACZ in question would remain in the bucket as well? I'm not sure this makes a difference for this task; probably Perma's sense of what exists should be the ground truth, not what's in the bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, when things are replaced, the object remains in the bucket, I think on the principle that deleting things from the bucket is dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing, thanks for the thorough explanation of the math!
As mentioned, I think either this doesn't want to be run until we fix the Perma Links that have an incorrect value for cached_can_play_back
, or, this code wants to be switched to instead check any links where either warc_size
or wacz_size
is truthy. That may be more correct: cached_can_play_back
is also False
if link.user_deleted
... and I'm not sure, but I think deleted archives will still be backed up to the mirror, in the normal course of business.
Code looks great! I mentioned a utility function a person could use for calculating etags.
I have no opinion on the propriety or impropriety of generating a script to run.... Could be a fun topic for discussion! Since the script in question would be read-only, strikes me as low-risk.
if o[archive]["path"]: | ||
for d in directories: | ||
full_path = Path(d) / "generated" / o[archive]["path"] | ||
if full_path.exists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a utility function for calculating etags now, if you want:
perma/perma_web/perma/utils.py
Line 574 in 06d7f73
def calculate_s3_etag(fp, chunk_size, multipart_format=False): |
It handles multipart uploads, which... I don't think we have any of, but I am not sure I remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More on that:
perma/perma_web/perma/celery_tasks.py
Line 1403 in 06d7f73
# retrieve S3's md5-hash-like "etag" of file it has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I meant to say something in the PR description about the possibility of multipart uploads, but forgot. Thanks. The reason I'm doing this inline and not using the function in utils is that this code will not run in Perma's context; though it strikes me that in sample_objects
I might be able to
from perma.utils import calculate_s3_etag
# and then when outputting the script
# ...
f.write(inspect.getsource(calculate_s3_etag))
# ...
Let me try that, I think....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, of course...
Separately, I've been considering removing the HistoricalLink model, as it is enormous, never gets used, and is unwieldy for backups and database upgrades -- but it strikes me that if the question is not "What links does Perma think have WARC or WACZ files" but "What links does Perma think have ever had WARC or WACZ files", the answer might be in HistoricalLink. |
After discussion on Slack, I'm hoping to get links and objects in the same way, but additionally get replaced objects by getting |
Ah yes, @rebeccacremona has already pointed out
|
f5bb51f
to
6ccf8aa
Compare
This is an experiment in asserting completeness of a backup of Perma WARC and WACZ files. We've done
something like this before, by enumeration; one brute-force approach is to enumerate all objects in the S3 buckets and record their etags, then see if any are not present on the mirror, or present but not with a matching etag.
Another approach is to enumerate, from Perma's database, all the artifacts that should be present, get their etags from S3, etc. Enumeration obviously becomes a bigger and bigger task as the collection grows, involves redoing work, and I think is somewhat fragile.
My idea is to assert completeness statistically. My first basic notion was something like "The null hypothesis is that at least 1 in x files do not make it to the mirror. To disprove the null hypothesis, you'd sample n in x captured files and, not seeing a failure, show that the chance of not seeing a failure in the sample, given the null hypothesis, falls below the typical value, a little under two percent."
On rereading part of my intro stats book, I think a better way of putting this is in terms of a sampling distribution model for a proportion. The proportion here is the proportion of artifacts that are not present on the mirror; a sampling distribution is a simulation, the distribution of the proportions from all possible samples. The basic idea is that it is normally distributed and so can be characterized like this:
"Provided that the sampled values are independent and the sample size (n) is large enough, the sampling distribution model of p̂ (observed proportion in the sample) is modeled by a Normal model with mean(p̂) = p (model parameter aka hypothesized proportion) and standard deviation of p̂ = the square root of (p * (1 - p)) / n."
The provisos (independent, large enough) translate to the randomization condition (sample is chosen randomly), the 10% condition (sample size is less than 10% of the population), and the success/failure condition (the sample size has to be big enough to expect at least ten successes and ten failures). That is, we can't test for a failure rate of one in five million (roughly, the population size), as we couldn't generate a sample size large enough in any case, let alone keeping it under 10% of the population.
A tenth of five million is 500,000 -- that seems like way too many to test. Let's say n is 1,000, well under 10% of the population. Then, let's pick proportions for success and failure that will produce ten failures; 99% success rate and 1% failure rate would produce ten failures and 990 successes in a sample of 1,000. The standard
deviation is the square root of ((0.99 * 0.01) / 1000), which is 0.00314642654451 or about a third of a percentage point (our unit of measure).
I expect that in a random sample of 1,000 links, the proportion of failures will be at least two standard deviations below the nominal 1% rate we're assuming here -- that is, that less than 2.5% of the time, a sample of this size drawn from a population with a true failure proportion of 1% would show fewer than four failures. In fact, I'm expecting our random sample to have one or zero failures, with only a one percent chance of occurring if the population actually had a 1% failure rate.
This PR adds an invoke task to generate a sample of Links, outputting object paths and etags. It generates a Python script to be run on the mirror, including the sample. That script compares the items in the sample with what is on disk, and reports successes, failures, standard deviation, and z-score; I think a second cut at this will add the probability of the seeing the proportion in the sample, probably using
NormalDist
fromstatistics
, but I want to sanity-check this with a probability table first. :)I am not sure about the pattern of generating a Python script as output; if that seems weird, I could generate the sample as JSON and keep the script somewhere else, maybe in
services/
in this repo. I'm not even that sure about generating the data and putting it in/tmp
, but we do that in other tasks.It is quite possible this is wrong. If any of this is confusing, please let me know!