-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First round of spot checks on SPs to validate Spark scores #226
First round of spot checks on SPs to validate Spark scores #226
Comments
An example problem we spotted: In Spark, we request only the root block, by adding We noticed that some SPs are serving only requests including |
Downloading whole files for spot check would work, but what do you think about using range requests for spot checks too? I have experimented today with different providers and way of retrieving content from them and found out that some of them did offer range retrievals. Great thing about these range retrievals was that I was sill able to verify root CID even though I have retrieved only some range of bytes. Of course, some providers were returning only root block even though I have requested a range of bytes. I think doing both would help us map what retrieval methods do storage providers offer apart from retrieving root block. Also, I am a bit uncertain if this should be a separate project or could we add it as a new script to spark (similar how we already have manual-check.js)? |
Great idea and this is really the end goal of Spark IMO. We don't want to retrieve the entire file when doing checks but we do need to check that you are able to retrieve more than just the root block. Range requests allow us to retrieve small chunks of files while preserving verifiability but not inflicting huge amounts of egress on the SP. |
Cool, this could also be a test-flight for retrieving byte ranges that could potentially make it's way up to the spark protocol. |
Are we concerned that spot checking with range requests makes the data less comparable to regular SPARK data, which does not use range requests? Eg could one provider offer full file download, but not range requests? The contra argument: Spark checks first block only, so this retrieval behavior is already different. Also, @pyropy argues that if they offer full body retrieval, they also offer range retrieval (they want range because it leads to less egress, but: it's harder to cache). |
I'm going to lay down implementation plan and my reasoning here Why new project?I had a doubt if I should create Spark spot check as part of the spark project or should it live as it's own thing. I decided to do the latter because I wanted to keep the Spark project nice and simple, and not cause too much changes there as it's running on the client side. Why Golang?This utility started of as a simple shell script that used lassie to perform the retrieval. At some point I figured out that Note that original shell script started as a small experiment and this is just a iteration on top of that. This is not a final thing and if needed we can rewrite it in a different language (js or other). ImplementationImplementation is rather simple and similar to what we are already doing in Spark. We fetch specified or current Spark round and pick a number of random checks to execute. For each check we fetch the miner peer info from cid.contact and then we use How to runMake sure you have Go installed on your machine. You can download it from here. Once you have Go installed, you can run the following commands to build and run the project: go build -o spark-spot-check This will create a binary called ./spark-spot-check check --checks <number_of_checks> --round <round_number> Full list of available options:
|
As discussed with @pyropy in our 1:1: Other reasons not to add this code to the spark repo is that we would have to change a few things:
|
As discussed with @pyropy in our 1:1 as well: Srdjan wasn't aware of Zinnia's Lassie integration, without which it wouldn't really have made sense to write it in JS/Zinnia (no GraphSync support for example). Now that we know we can use Zinnia, Srdjan's plan is to recreate this PR on top of it. Question: Should this implementation live in a separate repo, or in the spark repo? If separate:
But: If same repo:
My personal vote: |
IntroAfter a call with @juliangruber we have decided to move away from Go implementation for sake of keeping our codebase simple and reusing the parts we already have written. This has led to a second implementation of spot checker that is based on the existing Spark checker module. ImplementationNew implementation heavily relies on the existing spark implementation. Most of the codebase remains the same (IPNI and miner info lookup, helper functions, etc) with some changes to core modules of spark, namely Tasker and Spark module itself. Tasker module is now used to fetch retrieval tasks for a specified round (or current round if the round is not specified). The Spark module is renamed to SpotChecker. Similarly to how Spark works, this module also requests retrieval tasks from Tasker module which gets tasks for specified round (or current round if round number is not explicitly specified) and performs retrieval checks on those tasks. Difference between Spark and SpotChecker retrieval is that Spark only tries to retrieve the root block of the file, while SpotChecker tries to retrieve the whole file (which can be up to 32GB in size). Program flowMain difference between Spark and SpotChecker program flow is that Spark is running continuously while SpotChecker exits one the execution of all retrieval tasks is finished. Other differences is that for spot checks Tasker randomly samples tasks while the Tasker found in Spark filters for those tasks that are closest to the station id. And the final difference is that Spark only retrieves the root block, while spot check tries to perform full download of the file. Below you can find a program flow diagram describing execution of Spark Spot Check. |
Nice that both are supported 👍
I thought SpotChecker would perform a range request? Sorry if I'm missing some current decisions.
Would it make sense for the spot checker to perform all the tasks instead of randomly selecting them? |
I opted to do full retrieval in the first iteration for simplicity as I am not familiar with graphsync capabilities (and we still somewhat depend on graphsync). I will probably try to do them again as part of the spot check maybe even in this current iteration.
I think it would, at least we could do spot check on all tasks by default and have option to do some number of checks randomly. |
Let's keep it as simple as possible for the first iteration and just do full retrieval of all tasks. If we are concerned about the spot check taking too long, what about adding an option |
I think that's a better idea, we'll go with that |
I have since changed spot checks to perform byte range check by default. Reason for doing so is that we avoid storing whole piece file to memory (which can be almost 32GB in size). We can further improve this by getting piece size and randomising range which we want to fetch while keeping it's size to sane amount (right now we have capped range to be 200MB, starting at byte 0 and ending at byte 200). |
Currently we'll only print results of spot checks without actually evaluating the result. Should there be some mechanism that would actually evaluate results of spot checks so that storage provider RSR scores actually get affected by it? Maybe introducing some other score would also be an option. |
Since we trust the spot checking node, I believe evaluation isn't possible. Also, we just perform one retrieval per deal, so RSR per SP would most likely either be 100% or 0% |
Let's keep the first iteration simple. As far as I am concerned, printing the results of spot checks is good enough 👍🏻 If you want to make this more useful, format the output to make it easy to be further processed - e.g. copy&pasted to Slack/GitHub discussions. |
We need to be able to (meta-)check the validity of the Spark scores to ensure that SPs and/or Checkers are not exploiting some known or unknown attack vector.
We need a retrieval client that randomly selects a CID or set of CIDs from previous rounds and attempts to retrieve the entire file.
We can even use the complete set of CIDs from a certain round if it helps with the comparison.
We can then compare the results of the spot checks which are full retrievals with the Spark results. The closer the Spark score is to the spot check result, the better Spark is as a heuristic for Filecoin retrievability.
This client can be run locally by the Space Meridian team or deployed to a compute provider. However, if it is deployed somewhere with a fixed IP, SPs may be able to figure out which requests are the spot checks. Running locally may be a better option initially.
The text was updated successfully, but these errors were encountered: