Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First round of spot checks on SPs to validate Spark scores #226

Open
bajtos opened this issue Jan 30, 2025 · 17 comments · May be fixed by CheckerNetwork/spark-spot-check#2
Open

First round of spot checks on SPs to validate Spark scores #226

bajtos opened this issue Jan 30, 2025 · 17 comments · May be fixed by CheckerNetwork/spark-spot-check#2
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Jan 30, 2025

We need to be able to (meta-)check the validity of the Spark scores to ensure that SPs and/or Checkers are not exploiting some known or unknown attack vector.

We need a retrieval client that randomly selects a CID or set of CIDs from previous rounds and attempts to retrieve the entire file.

We can even use the complete set of CIDs from a certain round if it helps with the comparison.

We can then compare the results of the spot checks which are full retrievals with the Spark results. The closer the Spark score is to the spot check result, the better Spark is as a heuristic for Filecoin retrievability.

This client can be run locally by the Space Meridian team or deployed to a compute provider. However, if it is deployed somewhere with a fixed IP, SPs may be able to figure out which requests are the spot checks. Running locally may be a better option initially.

@bajtos bajtos mentioned this issue Jan 30, 2025
60 tasks
@bajtos
Copy link
Member Author

bajtos commented Jan 31, 2025

An example problem we spotted:

In Spark, we request only the root block, by adding ?dag-scope=block to the request.

We noticed that some SPs are serving only requests including ?dag-scope=block. When you remove that parameter and request the entire content, the HTTP response is abruptly aborted after ~100 bytes in a way that triggers a transport-level error in curl.

@pyropy pyropy self-assigned this Feb 4, 2025
@pyropy
Copy link

pyropy commented Feb 4, 2025

Downloading whole files for spot check would work, but what do you think about using range requests for spot checks too?

I have experimented today with different providers and way of retrieving content from them and found out that some of them did offer range retrievals. Great thing about these range retrievals was that I was sill able to verify root CID even though I have retrieved only some range of bytes. Of course, some providers were returning only root block even though I have requested a range of bytes.

I think doing both would help us map what retrieval methods do storage providers offer apart from retrieving root block.

Also, I am a bit uncertain if this should be a separate project or could we add it as a new script to spark (similar how we already have manual-check.js)?

@patrickwoodhead
Copy link

Downloading whole files for spot check would work, but what do you think about using range requests for spot checks too?

Great idea and this is really the end goal of Spark IMO. We don't want to retrieve the entire file when doing checks but we do need to check that you are able to retrieve more than just the root block. Range requests allow us to retrieve small chunks of files while preserving verifiability but not inflicting huge amounts of egress on the SP.

@pyropy
Copy link

pyropy commented Feb 5, 2025

Downloading whole files for spot check would work, but what do you think about using range requests for spot checks too?

Great idea and this is really the end goal of Spark IMO. We don't want to retrieve the entire file when doing checks but we do need to check that you are able to retrieve more than just the root block. Range requests allow us to retrieve small chunks of files while preserving verifiability but not inflicting huge amounts of egress on the SP.

Cool, this could also be a test-flight for retrieving byte ranges that could potentially make it's way up to the spark protocol.

@juliangruber
Copy link
Member

juliangruber commented Feb 6, 2025

Are we concerned that spot checking with range requests makes the data less comparable to regular SPARK data, which does not use range requests? Eg could one provider offer full file download, but not range requests?

The contra argument: Spark checks first block only, so this retrieval behavior is already different. Also, @pyropy argues that if they offer full body retrieval, they also offer range retrieval (they want range because it leads to less egress, but: it's harder to cache).

@pyropy
Copy link

pyropy commented Feb 6, 2025

I'm going to lay down implementation plan and my reasoning here

Why new project?

I had a doubt if I should create Spark spot check as part of the spark project or should it live as it's own thing. I decided to do the latter because I wanted to keep the Spark project nice and simple, and not cause too much changes there as it's running on the client side.

Why Golang?

This utility started of as a simple shell script that used lassie to perform the retrieval. At some point I figured out that lassie was failing to resolve miner peer info (ID and multiaddress) as heyfil service is no longer being hosted by cid.contact team. Given that, I have decided to rewrite the utility in Golang to use lassie as a library rather than a CLI tool.

Note that original shell script started as a small experiment and this is just a iteration on top of that. This is not a final thing and if needed we can rewrite it in a different language (js or other).

Implementation

Implementation is rather simple and similar to what we are already doing in Spark. We fetch specified or current Spark round and pick a number of random checks to execute.

For each check we fetch the miner peer info from cid.contact and then we use lassie to perform the retrieval (note that all retrievals are done concurrently). All results are then stored in a JSON file that can be later used to analyze the results.

How to run

Make sure you have Go installed on your machine. You can download it from here. Once you have Go installed, you can run the following commands to build and run the project:

go build -o spark-spot-check

This will create a binary called spark-spot-check in the current directory. You can run checks by executing:

./spark-spot-check check --checks <number_of_checks> --round <round_number>

Full list of available options:

NAME:
   spark-spot-check check - Fetches content from the IPFS and Filecoin network

USAGE:
   spark-spot-check check [command options]

OPTIONS:
   --output value, -o value            output file (default: "results.json")
   --checks value, -c value            number of checks to perform per round; if set to -1 it will perform all checks (default: 10)
   --round value, -r value             round number, -1 for current (latest) round (default: -1)
   --meridian-address value, -m value  address of the Meridian smart contract (default: "0x8460766edc62b525fc1fa4d628fc79229dc73031")
   --dag-scope value                   scope of the checks to perform, one of: all, block, range (default: "all")
   --range-start value                 start of the range to check, only used if scope is range (default: 0)
   --range-end value                   end of the range to check, only used if scope is range (default: 0)
   --rpc value                         Filecoin RPC node endpoint (default: "https://api.node.glif.io/rpc/v1")
   --ipni value, -i value              IPNI API endpoint (default: "https://cid.contact")
   --help, -h                          show help

@juliangruber
Copy link
Member

juliangruber commented Feb 6, 2025

Why new project?

As discussed with @pyropy in our 1:1: Other reasons not to add this code to the spark repo is that we would have to change a few things:

  • retrieval url generation
  • new tasker
  • Add options to all relevant methods so that both settings above are passed down to all relevant functions

@juliangruber
Copy link
Member

juliangruber commented Feb 6, 2025

Why Golang?

As discussed with @pyropy in our 1:1 as well: Srdjan wasn't aware of Zinnia's Lassie integration, without which it wouldn't really have made sense to write it in JS/Zinnia (no GraphSync support for example). Now that we know we can use Zinnia, Srdjan's plan is to recreate this PR on top of it.

Question: Should this implementation live in a separate repo, or in the spark repo?

If separate:

  • The Spark code base stays clean, it doesn't get more complicated to accommodate the spot checker
  • The spot checker implementation is independent

But:

If same repo:

  • It should be easier to keep both implementations in sync, so that the retrieval checking itself is as similar as possible.

My personal vote:
Keep it separate, as the spark code base is production code, and the spot checker currently is more in PoC phase. Therefore reuse the critical parts, which are Zinnia/Lassie, but keep the rest of the logic independent.

@pyropy
Copy link

pyropy commented Feb 7, 2025

Intro

After a call with @juliangruber we have decided to move away from Go implementation for sake of keeping our codebase simple and reusing the parts we already have written. This has led to a second implementation of spot checker that is based on the existing Spark checker module.

Implementation

New implementation heavily relies on the existing spark implementation. Most of the codebase remains the same (IPNI and miner info lookup, helper functions, etc) with some changes to core modules of spark, namely Tasker and Spark module itself.

Tasker module is now used to fetch retrieval tasks for a specified round (or current round if the round is not specified).

The Spark module is renamed to SpotChecker. Similarly to how Spark works, this module also requests retrieval tasks from Tasker module which gets tasks for specified round (or current round if round number is not explicitly specified) and performs retrieval checks on those tasks.

Difference between Spark and SpotChecker retrieval is that Spark only tries to retrieve the root block of the file, while SpotChecker tries to retrieve the whole file (which can be up to 32GB in size).

Program flow

Main difference between Spark and SpotChecker program flow is that Spark is running continuously while SpotChecker exits one the execution of all retrieval tasks is finished. Other differences is that for spot checks Tasker randomly samples tasks while the Tasker found in Spark filters for those tasks that are closest to the station id. And the final difference is that Spark only retrieves the root block, while spot check tries to perform full download of the file.

Below you can find a program flow diagram describing execution of Spark Spot Check.

Image

@juliangruber
Copy link
Member

Tasker module is now used to fetch retrieval tasks for a specified round (or current round if the round is not specified).

Nice that both are supported 👍

while SpotChecker tries to retrieve the whole file (which can be up to 32GB in size).

I thought SpotChecker would perform a range request? Sorry if I'm missing some current decisions.

Main difference between Spark and SpotChecker program flow is that Spark is running continuously while SpotChecker exits one the execution of all retrieval tasks is finished. Other differences is that for spot checks Tasker randomly samples tasks while the Tasker found in Spark filters for those tasks that are closest to the station id.

Would it make sense for the spot checker to perform all the tasks instead of randomly selecting them?

@pyropy
Copy link

pyropy commented Feb 7, 2025

I thought SpotChecker would perform a range request? Sorry if I'm missing some current decisions.

I opted to do full retrieval in the first iteration for simplicity as I am not familiar with graphsync capabilities (and we still somewhat depend on graphsync). I will probably try to do them again as part of the spot check maybe even in this current iteration.

Would it make sense for the spot checker to perform all the tasks instead of randomly selecting them?

I think it would, at least we could do spot check on all tasks by default and have option to do some number of checks randomly.

@juliangruber
Copy link
Member

juliangruber commented Feb 7, 2025

Let's keep it as simple as possible for the first iteration and just do full retrieval of all tasks. If we are concerned about the spot check taking too long, what about adding an option maxTasks=n?

@pyropy
Copy link

pyropy commented Feb 7, 2025

Let's keep it as simple as possible for the first iteration and just do full retrieval of all tasks. If we are concerned with the spot check taking too long, what about adding an option maxTasks=n?

I think that's a better idea, we'll go with that

@pyropy
Copy link

pyropy commented Feb 10, 2025

I opted to do full retrieval in the first iteration for simplicity as I am not familiar with graphsync capabilities (and we still somewhat depend on graphsync). I will probably try to do them again as part of the spot check maybe even in this current iteration.

I have since changed spot checks to perform byte range check by default. Reason for doing so is that we avoid storing whole piece file to memory (which can be almost 32GB in size). We can further improve this by getting piece size and randomising range which we want to fetch while keeping it's size to sane amount (right now we have capped range to be 200MB, starting at byte 0 and ending at byte 200).

@pyropy
Copy link

pyropy commented Feb 10, 2025

Currently we'll only print results of spot checks without actually evaluating the result. Should there be some mechanism that would actually evaluate results of spot checks so that storage provider RSR scores actually get affected by it? Maybe introducing some other score would also be an option.

@juliangruber
Copy link
Member

Currently we'll only print results of spot checks without actually evaluating the result. Should there be some mechanism that would actually evaluate results of spot checks so that storage provider RSR scores actually get affected by it? Maybe introducing some other score would also be an option.

Since we trust the spot checking node, I believe evaluation isn't possible. Also, we just perform one retrieval per deal, so RSR per SP would most likely either be 100% or 0%

@bajtos
Copy link
Member Author

bajtos commented Feb 11, 2025

Let's keep the first iteration simple. As far as I am concerned, printing the results of spot checks is good enough 👍🏻

If you want to make this more useful, format the output to make it easy to be further processed - e.g. copy&pasted to Slack/GitHub discussions.

@bajtos bajtos moved this to 🏗 in progress in Space Meridian Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 in progress
4 participants